TaggerClient tokenizes text into terms and retrieves tags from the tagger client currently
employed. The TaggerClient uses the tokenizer and lexical lookup classes to tokenize incoming
text into documents, sections, sentences, lexical elements, and tokens. The tagger is then
employed to assign Part of Speech (POS) tags for each of the tokens. The lexical element (i.e.
terms) part of speech assignments are, in general taken from the tokens that make up the lexical
element instances. The pos tags of multi-word lexical elements are taken from SPECIALIST
Lexicon if the lexical element was created from a Lexicon entry. For those multi-token lexical
elements that were created from some shape or pattern, the part of speech is assigned from the
definition of the shape, rather than from the token's part of speech assignment.
The tagger client is specified in the configuration file (either in the MMTxRegistry.cfg for MMTx
users, or NLPRegistry.cfg for textTool users). The current client employed is the MedPostSKR
Tagger, a Java implementation of the trained component of Larry Smith's MedPOST POS
Tagger[1].
The tagger client expects input either from an input file,or from standard input. If the input is from
standard input, the tagger client looks for two end of line markers before starting to process the
input.
------------------------------------------------------------------------------------------------------------------------------------
[1] Larry H. Smith, Thomas Rindflesch, and W. John Wilbur. 2004. MedPost: a part-of- speech
tagger for biomedical text. Bioinformatics, 20(14):2320–2321