Notes on Tagger Integration

Notes on Tagger Integration

MetaMap Transfer
(MMTx)

Rosetta Stone

Home

Documentation

Prerequisites

2.4.A Prerequisites

Resources

Download
(Restricted)

Install

Run MMTx

Customize

Trouble Reporter

Review Status
of Trouble Reports

FAQ

Statistics

User's Group
Notes

Administration
(Restricted)

The purpose of these notes is to describe how to make a part of speech tagger client available to MMTx. Such a tagger can be used to resolve syntactic ambiguities during sentence parsing. The notes consist of a description of the data structures used for parsing and details of the tagging process with references to both the Xerox Park and Brill Part of Speech taggers.

MMTx Data Structures

An MMTx Sentence is a Java object that consists of phrases, lexical elements and tokens:

Tokens are the primitive components of Sentences. They consist of sequences of alphanumeric characters broken by punctuation or whitespace. Individual punctuation marks become separate tokens, but whitespace does not. Internally, Tokens have the attributes: an identifier, a part of speech tag, the character span relative to the whole document, whether the token is punctuation, and the String associated with the token. There are methods to retrieve each of these pieces of information. Sentence objects have a method, getTokens(), to retrieve their set of Tokens. This set, which is actually a Vector, can be used for iteration processing on the Sentence.
Tagging
Given the above data structures, the task of tagging a Sentence consists of:

sending the String associated with the Sentence to a tagger;
reading the tagged tokens produced by the tagger; and
aligning each tagged token with the Sentence Tokens, adding the tags to the Tokens.

We chose to have the tagger client be an instantiated class, which allows us to do some initialization such as opening a socket to a tagger server. (Other implementations could open and initialize an external process.)
Our tag(Sentence pSentence) method takes the Sentence object described above and does the following:

it sends the sentence (gotten from the getOriginalString() method) to the tagger server; and
it reads the output from the server.
(Other implementations will do something similar.)
We used the Xerox tagger with our SPECIALIST lexicon (trivially transformed to the appropriate format) and trained the tagger using a set of MEDLINE citations. We also wrote a simple wrapper for the tagger that produces output of the form:
[ ['this', 'pron'], ['is', 'aux'], ['a', 'det'], ['test', 'noun'], ['.', 'pd'] ].
The tags are simply added to each Token of the Sentence, making sure that the token strings are aligned throughout the process.
It is useful to note here that tagging is done at the most primitive level of tokenization. Primitive tokens make up larger units, the lexicalElements mentioned above, which are created by matching (possibly multi-word) entries in the SPECIALIST lexicon. lexicalElements inherit tags from their Tokens according to the following convention: If a lexical Element has only one Token, use its tag. If it has multiple Tokens use the tag from the first lexical entry, if any; otherwise, use the tag from the rightmost Token.
Considerations for the Brill Tagger
If the Brill Tagger were to be employed in MMTx, some issues need to be resolved. Besides some straightforward formatting considerations, the most important issues arise from tag set differences.
There is no one-to-one mapping between the Penn Treebank tags used by the Brill tagger and the MMTx tags used by our parser. (The MMTx tags are based on the tag set of our SPECIALIST lexicon). In particular, the Penn Treebank tags IN (Preposition or subordinating conjunction), TO (to) and PDT (Predeterminer) map to more than one SPECIALIST tag; and the tags EX (Existential there) and POS (Possessive ending) convey more information that their SPECIALIST counterparts. Finally, the Penn Treebank tags FW (Foreign word), LS (List marker), SYM (Symbol) and UH (Interjection) have no SPECIALIST analogs at all. Note, however, that this last class of differences causes no difficulty because they are not relevant to MMTx processing anyway.
A first attempt at mapping Penn Treebank tags to our tags is shown in the table at the end of these notes. In this approximation of the Penn Treebank tags, IN and TO are mapped to PREP, PDT is mapped to DETERMINER, EX to ADVERB, POS to NOUN, and FW, LS, SYM and UH to UNKNOWN. In each case, the mapping fails to completely represent the Penn Treebank tag's meaning.
To improve the fidelity of the mapping, modifications to the SPECIALIST tag set could be made, especially for the EX and POS tags. This would, of course, entail corresponding modifications to the MMTx parser.
Another solution would be to modify the Brill tagger to use the SPECIALIST tag set. Note that the Fast Transformation-Based Learning (TBL) Toolkit developed by the NLP group at Johns Hopkins University provides exactly what is needed here: an efficient implementation of the Brill tagger using arbitrary tag sets. However, this approach would require a corpus tagged with the SPECIALIST tag set for training the tagger. A transformation of the Penn Treebank could be constructed, or another corpus created using the Xerox tagger could form the basis of a bootstrapping approach to the problem.
Mapping from the Penn Treebank Tags to MMTx/SPECIALIST Tags

Penn Treebank Tags MMTx Tags SPECIALIST Tags Notes

1. CC Coordinating conjunction CONJUNCTION conj

2. CD Cardinal number NUMBER -- Maps as a shape.

3. DT Determiner DETERMINER det

4. EX Existential there ADVERB adv In a future version, we may alter the SPECIALIST Lexicon's tag set and the parser to better handle this tag.

5. FW Foreign word UNKNOWN --

6. IN Preposition or subordinating conjunction PREP prep In a future version, we may alter the SPECIALIST Lexicon's tag set and the parser to better handle this tag.

7. JJ Adjective ADJECTIVE adj

8. JJR Adjective, comparative ADJECTIVE adj

9. JJS Adjective, superlative ADJECTIVE adj

10. LS List item marker UNKNOWN --

11. MD Modal MODAL modal

12. NN Noun, singular or mass NOUN noun

13. NNS Noun, plural NOUN noun

14. NNP Proper noun, singular NOUN noun

15. NNPS Proper noun, plural NOUN noun

16. PDT Predeterminer DETERMINER det all, both, many, such are marked as dets; half, rather are marked as advs.
In a future version, we may alter the SPECIALIST Lexicon's tag set and the parser to better handle this tag.

17. POS Possessive ending NOUN -- In a future version, we may alter the SPECIALIST Lexicon's tag set and the parser to better handle this tag.

18. PRP Personal pronoun PRONOUN pron

19. PRP$ Possessive pronoun PRONOUN pron

20. RB Adverb ADVERB adv

21. RBR Adverb, comparative ADVERB adv

22. RBS Adverb, superlative ADVERB adv

23. RP Particle ADVERB adv

24. SYM Symbol UNKNOWN --

25. TO to PREP prep In a future version, we may alter the SPECIALIST Lexicon's tag set and the parser to better handle this tag.

26. UH Interjection UNKNOWN --

27. VB Verb, base form VERB verb

28. VBD Verb, past tense VERB verb

29. VBG Verb, gerund or present participle VERB verb We may want to mark the inflection as present_participle.

30. VBN Verb, past participle VERB verb

31. VBP Verb, non-3rd person singular present VERB verb

32. VBZ Verb, 3rd person singular present VERB verb

33. WDT Wh-determiner DETERMINER det

34. WP Wh-pronoun PRONOUN pron

35. WP$ Possessive wh-pronoun PRONOUN pron

36. WRB Wh-adverb ADVERB adv

Last Modified: March 30, 2007

ii-public

Links to Our Sites

MetaMap Public Release

NEW: Distributable version of the actual MetaMap program.

Indexing Initiative (II)

Investigating computer-assisted and fully automatic methodologies for indexing biomedical text. Includes the NLM Medical Text Indexer (MTI).

Semantic Knowledge Representation (SKR)

Develop programs to provide usable semantic representation of biomedical text. Includes the MetaMap and SemRep programs.

MetaMap Transfer (MMTx)

Java-Based distributable version of the MetaMap program.

Word Sense Disambiguation (WSD)

Test collection of manually curated MetaMap ambiguity resolution in support of word sense disambiguation research.

Medline Baseline Repository (MBR)

Static MEDLINE Baselines for use in research involving biomedical citations. Allows for query searches and test collection creation.

Lister Hill National Center for Biomedical Communications

U.S. National Library of Medicine

National Institutes of Health

Department of Health and Human Services