3. Part-Of-Speech Tagger Client
Introduction
TaggerClient tokenizes text into terms and retrieves tags from the tagger client currently employed.  The TaggerClient uses the tokenizer and lexical lookup classes to tokenize incoming text into documents, sections, sentences, lexical elements, and tokens.  The tagger is then employed to assign Part of Speech (POS) tags for each of the tokens.   The lexical element (i.e. terms)  part of speech assignments are, in general taken from the tokens that make up the lexical element instances.   The pos tags of multi-word lexical elements are taken from SPECIALIST Lexicon if the lexical element was created from a Lexicon entry.  For those multi-token lexical elements that were created from some shape or pattern, the part of speech is assigned from the definition of the shape, rather than from the token's part of speech assignment.
The tagger client is specified in the configuration file (either in the MMTxRegistry.cfg for MMTx users, or NLPRegistry.cfg for textTool users).  The current client employed is the MedPostSKR Tagger, a Java implementation of the trained component of Larry Smith's MedPOST POS Tagger[1]. 
The tagger client expects input either from an input file,or from standard input. If the input is from standard input, the tagger client looks for two end of line markers before starting to process the input.
------------------------------------------------------------------------------------------------------------------------------------
[1] Larry H. Smith, Thomas Rindflesch, and W. John Wilbur. 2004. MedPost: a part-of- speech tagger for biomedical text. Bioinformatics, 20(14):2320–2321
Usage
     taggerClient[.bat]  [options]
   Where the options can be listed in any order.
Global Options

Long Name
Short Name
Description
--help
-h
Show the help
Input/Output Options

Long Name
Description
--fileName=
FullPathName of the input file to process. This is standard input if not specified.
--outputFileName=
FullPath of the output file. This is standard output, if not specified.
--inputType=
Determines the format of the input file. This parameter can take the following values:

medlineCitations  MedLINE's format for an abstract
fieldedText          Implicit with this flag is that each newline delimited row is considered a separate document
mrcon                The eight column format for the UMLS file MRCON
freeText              The input is free text
autoDetect          [Default]
--textField=n
For fielded text, which field contains the text. The default textField is 2.
--fieldSeparator="|"
For fielded text, what char is the separator. The default is a PIPE.
Data Options

Long Name
Description
--NLP_ROOT=somePath
The root path to the TextTools.
--ambiguousAcronyms
Disambiguate sentence boundaries using the acronyms and abbreviations file. This is turned off by default.
--configName=nlp.cfg
The name an overriding configuration file. Command line settings can be put into this file to override the default settings for a particular run.
--ambiguousAcronymsFile=fileName
Path of the acronyms and abbreviations file needed in the tokenizer to determine sentence breaks.
Display Options
Long Name
Short Name
Description
--documents
 
 
Display Document information. Off by default. This prints out the document id. This is useful when processing files that contain many MEDLINE citations.
--sections
 
 
Display Section information.  Off by default. This prints out the section id and section labels for each section. For example:
     Section: 1|MEDLINE Unique Identifier|
     Section: 2|Date Created|
     Section: 3|ISSN|
--sentences    
 
 
Display Sentence information. Off by default.This prints out each sentence, one per line. When combined with the --pipedOutput option, this option prints out the sentence id, the begin and end character offsets relative to the section, and the sentence.
Sentence|11|358|422|Red wines contain a variety of polyphenolic antioxidants.
--phrases
 
 
Display the phrases. When the --pipedOutput flag is turned on, the output fields for this element include:
Phrase Tag|Phrase Number|Begin Char offset|End Char Offset|phase|reduced phrase|Number of phrase tokens|has Head
Where the reduced phrase is the phrase minus determiners, prepositions, verbs, aux's, modals and punctuation.
The has head is a boolean which currently only gets set during MMTx processing.
--lexicalElements
 
 
Display the lexical elements. When the -- pipedOutput flag is turned on, the output fields for this element include:
Lexical Element Tag|Element Number|Type|Category|Term|Begin Char Offset|End Char Offset
Where the Type is either Lexicon,Shape, or Punctuation.
--lexicalEntries
 
 
Display the entries from the SPECIALIST Lexicon for this term. When the pipedOutput flag is turned on, the output fields for this element include:
Lexical Entry Tag|Term|Category|Entry Unique Identifier
--tokens
 
 
Display tokens. Off by default. This option prints out each token, one per line. When combined with the --pipedOutput option, this option prints out the token id, the begin and end character offsets (relative to the document), the token number relative to the beginning of the sentence, and token.
Token|236|1334|1337|0|-1|Thus|||
--pipedOutput
 
 
Display in a pipe delimited format. Off by default.
-- indicate_citation_end
-E
Indicate citation end. Off by default.
Lookup Options
Long Name
Description
--shortestSpanningMatch
Use the shortest spanning match algorithm.
--longestSpanningMatch
Use the longest spanning match algorithm (The default)
--KSYear=2003
Specify the Knowledge Source Year or a custom label to open that version's Lexicon. The default is 2004.
See indexing a custom built lexicon for building and using a custom built Lexicon.
Tagger Options

Long Name
Description
--tagger_output
Display the tags
--tagger=medpostskr
  This is currently the only option. This variable is used to switch to other implementations of a tagger client within the taggerservices factory class when they become available.   Within NLM, we had an internal server running around a locally trained Xerox Parc tagger. A socket based client was built and implemented as a second implementation.
--taggerMachineName=
The name of the socket based server that this client could query. [Deprecated]
--taggerPortNumber=
The tagger server port number. [Deprecated]
--useTagger
Toggles whether to use the tagger or not. By default, this is turned on.
--tag_text
--dontUseTagger
Toggles whether to use the tagger or not. By default this is turned off. This variable exists as a synonym of MetaMap's -- tag_text variable, which turns off the tagger when specified.  [Deprecated]
Example
Where the output for a LexicalElement is
Tag
Element Number
Pedigree of Term
Part of Speech
Lexical Element
Begin Offset
End Offset
Irrelevant Boolean Field.


> taggerClient
The doctor ran over the tests.
Sentence|0|0|31|The doctor ran over the tests.
Lexical Element|0|LEXICON|det|The|0|2|false
Lexical Element|1|LEXICON|noun|doctor|4|9|false
Lexical Element|2|LEXICON|verb|ran|11|13|false
Lexical Element|3|LEXICON|prep|over|15|18|false
Lexical Element|4|LEXICON|det|the|20|22|false
Lexical Element|5|LEXICON|noun|tests|24|28|false
Lexical Element|6|PUNCTUATION|punctuation|.|29|29|false