Notes on Tagger Integration


MetaMap Transfer
(MMTx)

Rosetta Stone: Metaphor for MetaMap/SKR work
Rosetta Stone

Home


Documentation


Prerequisites


2.4.A Prerequisites


Resources


Download
(Restricted)


Install


Run MMTx


Customize


Trouble Reporter


Review Status
of Trouble Reports


FAQ


Statistics


User's Group
Notes


Administration
(Restricted)
     

The purpose of these notes is to describe how to make a part of speech tagger client available to MMTx. Such a tagger can be used to resolve syntactic ambiguities during sentence parsing. The notes consist of a description of the data structures used for parsing and details of the tagging process with references to both the Xerox Park and Brill Part of Speech taggers.

MMTx Data Structures

An MMTx Sentence is a Java object that consists of phrases, lexical elements and tokens:

Sentence contains Phrases, lexicalElements  and Tokens

Tokens are the primitive components of Sentences. They consist of sequences of alphanumeric characters broken by punctuation or whitespace. Individual punctuation marks become separate tokens, but whitespace does not. Internally, Tokens have the attributes: an identifier, a part of speech tag, the character span relative to the whole document, whether the token is punctuation, and the String associated with the token. There are methods to retrieve each of these pieces of information. Sentence objects have a method, getTokens(), to retrieve their set of Tokens. This set, which is actually a Vector, can be used for iteration processing on the Sentence.

Tagging

Given the above data structures, the task of tagging a Sentence consists of:

  • sending the String associated with the Sentence to a tagger;
  • reading the tagged tokens produced by the tagger; and
  • aligning each tagged token with the Sentence Tokens, adding the tags to the Tokens.

We chose to have the tagger client be an instantiated class, which allows us to do some initialization such as opening a socket to a tagger server. (Other implementations could open and initialize an external process.)

Our tag(Sentence pSentence) method takes the Sentence object described above and does the following:

  • it sends the sentence (gotten from the getOriginalString() method) to the tagger server; and
  • it reads the output from the server.
(Other implementations will do something similar.)

We used the Xerox tagger with our SPECIALIST lexicon (trivially transformed to the appropriate format) and trained the tagger using a set of MEDLINE citations. We also wrote a simple wrapper for the tagger that produces output of the form:

         [
         ['this', 'pron'],
         ['is', 'aux'],
         ['a', 'det'],
         ['test', 'noun'],
         ['.', 'pd']
         ].
         
The tags are simply added to each Token of the Sentence, making sure that the token strings are aligned throughout the process.

It is useful to note here that tagging is done at the most primitive level of tokenization. Primitive tokens make up larger units, the lexicalElements mentioned above, which are created by matching (possibly multi-word) entries in the SPECIALIST lexicon. lexicalElements inherit tags from their Tokens according to the following convention: If a lexical Element has only one Token, use its tag. If it has multiple Tokens use the tag from the first lexical entry, if any; otherwise, use the tag from the rightmost Token.

Considerations for the Brill Tagger

If the Brill Tagger were to be employed in MMTx, some issues need to be resolved. Besides some straightforward formatting considerations, the most important issues arise from tag set differences.

There is no one-to-one mapping between the Penn Treebank tags used by the Brill tagger and the MMTx tags used by our parser. (The MMTx tags are based on the tag set of our SPECIALIST lexicon). In particular, the Penn Treebank tags IN (Preposition or subordinating conjunction), TO (to) and PDT (Predeterminer) map to more than one SPECIALIST tag; and the tags EX (Existential there) and POS (Possessive ending) convey more information that their SPECIALIST counterparts. Finally, the Penn Treebank tags FW (Foreign word), LS (List marker), SYM (Symbol) and UH (Interjection) have no SPECIALIST analogs at all. Note, however, that this last class of differences causes no difficulty because they are not relevant to MMTx processing anyway.

A first attempt at mapping Penn Treebank tags to our tags is shown in the table at the end of these notes. In this approximation of the Penn Treebank tags, IN and TO are mapped to PREP, PDT is mapped to DETERMINER, EX to ADVERB, POS to NOUN, and FW, LS, SYM and UH to UNKNOWN. In each case, the mapping fails to completely represent the Penn Treebank tag's meaning.

To improve the fidelity of the mapping, modifications to the SPECIALIST tag set could be made, especially for the EX and POS tags. This would, of course, entail corresponding modifications to the MMTx parser.

Another solution would be to modify the Brill tagger to use the SPECIALIST tag set. Note that the Fast Transformation-Based Learning (TBL) Toolkit developed by the NLP group at Johns Hopkins University provides exactly what is needed here: an efficient implementation of the Brill tagger using arbitrary tag sets. However, this approach would require a corpus tagged with the SPECIALIST tag set for training the tagger. A transformation of the Penn Treebank could be constructed, or another corpus created using the Xerox tagger could form the basis of a bootstrapping approach to the problem.

Mapping from the Penn Treebank Tags to MMTx/SPECIALIST Tags

Penn Treebank Tags MMTx Tags SPECIALIST Tags Notes
1. CC Coordinating conjunction CONJUNCTION conj  
2. CD Cardinal number NUMBER -- Maps as a shape.
3. DT Determiner DETERMINER det  
4. EX Existential there ADVERB adv In a future version, we may alter the SPECIALIST Lexicon's tag set and the parser to better handle this tag.
5. FW Foreign word UNKNOWN --  
6. IN Preposition or subordinating conjunction PREP prep In a future version, we may alter the SPECIALIST Lexicon's tag set and the parser to better handle this tag.
7. JJ Adjective ADJECTIVE adj  
8. JJR Adjective, comparative ADJECTIVE adj  
9. JJS Adjective, superlative ADJECTIVE adj  
10. LS List item marker UNKNOWN --  
11. MD Modal MODAL modal  
12. NN Noun, singular or mass NOUN noun  
13. NNS Noun, plural NOUN noun  
14. NNP Proper noun, singular NOUN noun  
15. NNPS Proper noun, plural NOUN noun  
16. PDT Predeterminer DETERMINER det all, both, many, such are marked as dets; half, rather are marked as advs.
In a future version, we may alter the SPECIALIST Lexicon's tag set and the parser to better handle this tag.
17. POS Possessive ending NOUN -- In a future version, we may alter the SPECIALIST Lexicon's tag set and the parser to better handle this tag.
18. PRP Personal pronoun PRONOUN pron  
19. PRP$ Possessive pronoun PRONOUN pron  
20. RB Adverb ADVERB adv  
21. RBR Adverb, comparative ADVERB adv  
22. RBS Adverb, superlative ADVERB adv  
23. RP Particle ADVERB adv  
24. SYM Symbol UNKNOWN --  
25. TO to PREP prep In a future version, we may alter the SPECIALIST Lexicon's tag set and the parser to better handle this tag.
26. UH Interjection UNKNOWN --  
27. VB Verb, base form VERB verb  
28. VBD Verb, past tense VERB verb  
29. VBG Verb, gerund or present participle VERB verb We may want to mark the inflection as present_participle.
30. VBN Verb, past participle VERB verb  
31. VBP Verb, non-3rd person singular present VERB verb  
32. VBZ Verb, 3rd person singular present VERB verb  
33. WDT Wh-determiner DETERMINER det  
34. WP Wh-pronoun PRONOUN pron  
35. WP$ Possessive wh-pronoun PRONOUN pron  
36. WRB Wh-adverb ADVERB adv  


Last Modified: March 30, 2007 ii-public
Links to Our Sites
MetaMap Public Release
NEW: Distributable version of the actual MetaMap program.
Indexing Initiative (II)
Investigating computer-assisted and fully automatic methodologies for indexing biomedical text. Includes the NLM Medical Text Indexer (MTI).
Semantic Knowledge Representation (SKR)
Develop programs to provide usable semantic representation of biomedical text. Includes the MetaMap and SemRep programs.
MetaMap Transfer (MMTx)
Java-Based distributable version of the MetaMap program.
Word Sense Disambiguation (WSD)
Test collection of manually curated MetaMap ambiguity resolution in support of word sense disambiguation research.
Medline Baseline Repository (MBR)
Static MEDLINE Baselines for use in research involving biomedical citations. Allows for query searches and test collection creation.
Picture of Lister Hill Center Lister Hill National Center for Biomedical Communications   NLM Logo U.S. National Library of Medicine   NIH Logo National Institutes of Health
DHHS Logo Department of Health and Human Services
     Contact Us    |   Copyright    |   Privacy    |   Accessibility    |   Freedom of Information Act    |   USA.gov