Part-Of-Speech Tagger Client

3. Part-Of-Speech Tagger Client

Introduction

TaggerClient tokenizes text into terms and retrieves tags from the tagger client currently employed. The TaggerClient uses the tokenizer and lexical lookup classes to tokenize incoming text into documents, sections, sentences, lexical elements, and tokens. The tagger is then employed to assign Part of Speech (POS) tags for each of the tokens. The lexical element (i.e. terms) part of speech assignments are, in general taken from the tokens that make up the lexical element instances. The pos tags of multi-word lexical elements are taken from SPECIALIST Lexicon if the lexical element was created from a Lexicon entry. For those multi-token lexical elements that were created from some shape or pattern, the part of speech is assigned from the definition of the shape, rather than from the token's part of speech assignment.

The tagger client is specified in the configuration file (either in the MMTxRegistry.cfg for MMTx users, or NLPRegistry.cfg for textTool users). The current client employed is the MedPostSKR Tagger, a Java implementation of the trained component of Larry Smith's MedPOST POS Tagger[1].

The tagger client expects input either from an input file,or from standard input. If the input is from standard input, the tagger client looks for two end of line markers before starting to process the input.

------------------------------------------------------------------------------------------------------------------------------------

[1] Larry H. Smith, Thomas Rindflesch, and W. John Wilbur. 2004. MedPost: a part-of- speech tagger for biomedical text. Bioinformatics, 20(14):2320–2321

Usage

taggerClient[.bat] [options]

Where the options can be listed in any order.

Global Options

Long Name	Short Name	Description
--help	-h	Show the help

Input/Output Options

Long Name

Description

--fileName=

FullPathName of the input file to process. This is standard input if not specified.

--outputFileName=

FullPath of the output file. This is standard output, if not specified.

--inputType=

Determines the format of the input file. This parameter can take the following values:

medlineCitations MedLINE's format for an abstract

fieldedText          Implicit with this flag is that each newline delimited row is considered a separate document

mrcon                The eight column format for the UMLS file MRCON

freeText              The input is free text

autoDetect          [Default]

--textField=n

For fielded text, which field contains the text. The default textField is 2.

--fieldSeparator="|"

For fielded text, what char is the separator. The default is a PIPE.

Data Options

Long Name	Description
--NLP_ROOT=somePath	The root path to the TextTools.
--ambiguousAcronyms	Disambiguate sentence boundaries using the acronyms and abbreviations file. This is turned off by default.
--configName=nlp.cfg	The name an overriding configuration file. Command line settings can be put into this file to override the default settings for a particular run.
--ambiguousAcronymsFile=fileName	Path of the acronyms and abbreviations file needed in the tokenizer to determine sentence breaks.

Display Options

Long Name

Short Name

Description

--documents

Display Document information. Off by default. This prints out the document id. This is useful when processing files that contain many MEDLINE citations.

--sections

Display Section information. Off by default. This prints out the section id and section labels for each section. For example:

     Section: 1|MEDLINE Unique Identifier|

     Section: 2|Date Created|

     Section: 3|ISSN|

--sentences

Display Sentence information. Off by default.This prints out each sentence, one per line. When combined with the --pipedOutput option, this option prints out the sentence id, the begin and end character offsets relative to the section, and the sentence.

Sentence|11|358|422|Red wines contain a variety of polyphenolic antioxidants.

--phrases

Display the phrases. When the --pipedOutput flag is turned on, the output fields for this element include:

Phrase Tag|Phrase Number|Begin Char offset|End Char Offset|phase|reduced phrase|Number of phrase tokens|has Head

Where the reduced phrase is the phrase minus determiners, prepositions, verbs, aux's, modals and punctuation.

The has head is a boolean which currently only gets set during MMTx processing.

--lexicalElements

Display the lexical elements. When the -- pipedOutput flag is turned on, the output fields for this element include:

Lexical Element Tag|Element Number|Type|Category|Term|Begin Char Offset|End Char Offset

Where the Type is either Lexicon,Shape, or Punctuation.

--lexicalEntries

Display the entries from the SPECIALIST Lexicon for this term. When the pipedOutput flag is turned on, the output fields for this element include:

Lexical Entry Tag|Term|Category|Entry Unique Identifier

--tokens

Display tokens. Off by default. This option prints out each token, one per line. When combined with the --pipedOutput option, this option prints out the token id, the begin and end character offsets (relative to the document), the token number relative to the beginning of the sentence, and token.

Token|236|1334|1337|0|-1|Thus|||

--pipedOutput

Display in a pipe delimited format. Off by default.

-- indicate_citation_end

-E

Indicate citation end. Off by default.

Lookup Options

Long Name	Description
--shortestSpanningMatch	Use the shortest spanning match algorithm.
--longestSpanningMatch	Use the longest spanning match algorithm (The default)
--KSYear=2003	Specify the Knowledge Source Year or a custom label to open that version's Lexicon. The default is 2004. See indexing a custom built lexicon for building and using a custom built Lexicon.

Tagger Options

Long Name	Description
--tagger_output	Display the tags
--tagger=medpostskr	This is currently the only option. This variable is used to switch to other implementations of a tagger client within the taggerservices factory class when they become available. Within NLM, we had an internal server running around a locally trained Xerox Parc tagger. A socket based client was built and implemented as a second implementation.
--taggerMachineName=	The name of the socket based server that this client could query. [Deprecated]
--taggerPortNumber=	The tagger server port number. [Deprecated]
--useTagger	Toggles whether to use the tagger or not. By default, this is turned on.
--tag_text --dontUseTagger	Toggles whether to use the tagger or not. By default this is turned off. This variable exists as a synonym of MetaMap's -- tag_text variable, which turns off the tagger when specified. [Deprecated]

Example

Where the output for a LexicalElement is

Tag

Element Number

Pedigree of Term

Part of Speech

Lexical Element

Begin Offset

End Offset

Irrelevant Boolean Field.

> taggerClient

The doctor ran over the tests.

Sentence|0|0|31|The doctor ran over the tests.

Lexical Element|0|LEXICON|det|The|0|2|false

Lexical Element|2|LEXICON|verb|ran|11|13|false

Lexical Element|4|LEXICON|det|the|20|22|false

Lexical Element|6|PUNCTUATION|punctuation|.|29|29|false

Top

	Home \| Table of Contents \| Overview Map \| Icon Legend
	The Tools
	Contact umlslex@nlm.nih.gov

Lexical Systems Group Cognitive Science Branch Lister Hill National Center for Biomedical Communications National Library of Medicine National Institutes of Health Department of Health & Human Services Copyright - Privacy - Accessibility	Powered By MindManager X5
	Last updated: 11/11/2006