Notes on Tagger Integration |
Rosetta Stone |
The purpose of these notes is to describe how to make a part of speech tagger client available to MMTx. Such a tagger can be used to resolve syntactic ambiguities during sentence parsing. The notes consist of a description of the data structures used for parsing and details of the tagging process with references to both the Xerox Park and Brill Part of Speech taggers. MMTx Data Structures An MMTx Sentence is a Java object that consists of phrases, lexical elements and tokens:
Tokens are the primitive components of Sentences. They consist of sequences of alphanumeric characters broken by punctuation or whitespace. Individual punctuation marks become separate tokens, but whitespace does not. Internally, Tokens have the attributes: an identifier, a part of speech tag, the character span relative to the whole document, whether the token is punctuation, and the String associated with the token. There are methods to retrieve each of these pieces of information. Sentence objects have a method, getTokens(), to retrieve their set of Tokens. This set, which is actually a Vector, can be used for iteration processing on the Sentence. Tagging Given the above data structures, the task of tagging a Sentence consists of:
We chose to have the tagger client be an instantiated class, which allows us to do some initialization such as opening a socket to a tagger server. (Other implementations could open and initialize an external process.) Our tag(Sentence pSentence) method takes the Sentence object described above and does the following:
We used the Xerox tagger with our SPECIALIST lexicon (trivially transformed to the appropriate format) and trained the tagger using a set of MEDLINE citations. We also wrote a simple wrapper for the tagger that produces output of the form: [ ['this', 'pron'], ['is', 'aux'], ['a', 'det'], ['test', 'noun'], ['.', 'pd'] ].The tags are simply added to each Token of the Sentence, making sure that the token strings are aligned throughout the process. It is useful to note here that tagging is done at the most primitive level of tokenization. Primitive tokens make up larger units, the lexicalElements mentioned above, which are created by matching (possibly multi-word) entries in the SPECIALIST lexicon. lexicalElements inherit tags from their Tokens according to the following convention: If a lexical Element has only one Token, use its tag. If it has multiple Tokens use the tag from the first lexical entry, if any; otherwise, use the tag from the rightmost Token. Considerations for the Brill Tagger If the Brill Tagger were to be employed in MMTx, some issues need to be resolved. Besides some straightforward formatting considerations, the most important issues arise from tag set differences. There is no one-to-one mapping between the Penn Treebank tags used by the Brill tagger and the MMTx tags used by our parser. (The MMTx tags are based on the tag set of our SPECIALIST lexicon). In particular, the Penn Treebank tags IN (Preposition or subordinating conjunction), TO (to) and PDT (Predeterminer) map to more than one SPECIALIST tag; and the tags EX (Existential there) and POS (Possessive ending) convey more information that their SPECIALIST counterparts. Finally, the Penn Treebank tags FW (Foreign word), LS (List marker), SYM (Symbol) and UH (Interjection) have no SPECIALIST analogs at all. Note, however, that this last class of differences causes no difficulty because they are not relevant to MMTx processing anyway. A first attempt at mapping Penn Treebank tags to our tags is shown in the table at the end of these notes. In this approximation of the Penn Treebank tags, IN and TO are mapped to PREP, PDT is mapped to DETERMINER, EX to ADVERB, POS to NOUN, and FW, LS, SYM and UH to UNKNOWN. In each case, the mapping fails to completely represent the Penn Treebank tag's meaning. To improve the fidelity of the mapping, modifications to the SPECIALIST tag set could be made, especially for the EX and POS tags. This would, of course, entail corresponding modifications to the MMTx parser. Another solution would be to modify the Brill tagger to use the SPECIALIST tag set. Note that the Fast Transformation-Based Learning (TBL) Toolkit developed by the NLP group at Johns Hopkins University provides exactly what is needed here: an efficient implementation of the Brill tagger using arbitrary tag sets. However, this approach would require a corpus tagged with the SPECIALIST tag set for training the tagger. A transformation of the Penn Treebank could be constructed, or another corpus created using the Xerox tagger could form the basis of a bootstrapping approach to the problem. Mapping from the Penn Treebank Tags to MMTx/SPECIALIST Tags
|
Last Modified: March 30, 2007 | ii-public | |||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
Lister Hill National Center for Biomedical Communications | U.S. National Library of Medicine | National Institutes of Health | ||||||||||||||||||||||||||
Department of Health and Human Services | ||||||||||||||||||||||||||||
|