SPECIALIST dTagger

1. Introduction

The dTagger is a Part of Speech (POS) tagger. A POS tagger assigns an unambiguous part of speech such as noun, adjective, adverb to the words or terms within a text. Such tags are a necessary component to determining phrase barriers and head assignment commonly done within noun phrase extractors. Taggers in general, and this tagger is able to tag after some training. The sources to train from include some text where the parts of speech have already been assigned (an annotated corpus), a Lexicon of words and their potential parts of speech, and optionally, lots of plain text within the genre you are planning on using the tagger on. The dTagger is distributed with a trained model that was trained on the MedPost Corpus, a corpus of Medline abstracts in the genomics field hand annotated with parts of speech. This corpus is also being redistributed. The dTagger includes the SPECIALIST Lexicon as well.

Figure 1: An Abstract Highlighted with Parts of Speech

2. Motivation

Even though there are several publicly available POS taggers, we've had needs that motivated us to write our own. We have wanted a tagger that worked specifically with the SPECIALIST Lexicon. We wanted a tagger that natively used the tag set that is used within the SPECIALIST Lexicon. That being said, we wanted a tagger where the tag set was not hard coded in, so that other tag sets could be used. We wanted a tagger that included the trainer and could be trained on untagged text. We wanted a tagger that tokenized text into single words but more importantly, could tokenize text into multi-word terms, the same granularity as that of the SPECIALIST Lexicon. The SPECIALIST TextTools already include this kind of tokenization. We would also like this tagger to be flexible enough to be turned to different languages.

3. Download and Install

Download the package from here:

Package Name

Size

dtaggerDist.jar

240 mb

Prerequisites
- Java 1.4.2.2 or greater
- 2 gig of hard disk space
- 300 Mb of Memory. But the more the better.
Installation Instructions
- Un-jar the dtaggerDist.jar into the location where you want to install the nls projects. When un- archived, a nls/nlpdirectory will exist.
  > jar xvf dtaggerDist.jar
- Change directories to the nls/nlp directory
  > cd nls/dtagger
- Invoke the install.[bat|sh]
  >install.bat

The install will create the following scripts in the nls/dtagger/bin directory. These are the scripts to kick off each of the applications:

trainWithTaggedText[.bat]

tag[.bat]

updateWithUntagged[.bat]

trainWithUntaggedText[.bat]

DiscoverMorphology[.bat]

LRAGR2lex[.bat]

VerbsToAdjs

MakeLexicalLookupIndexes

Optional Post Installation Actions
- Add the nls/dtagger/bin directory to the $PATH environment variable. This will enable these programs to be run from any directory.
- Add to the $CLASSPATH environment variable nls/dtagger/lib/dtaggerProject.jar; nls/dtagger/config.This will enable applications that have these tools embedded in them to find the classes and data.

4. References

Divita G, Loane R, Browne AC, dTagger, a POS tagger, AMIA Fall Symposium, 2006, [in press]

Notes on the Hidden Markov Model used within dTagger: http://SPECIALST.nlm.nih.gov/dtagger/markov.html

The SPECIALIST Text Tools: http://lexsrv3.nlm.nih.gov/SPECIALIST/Projects/Summary/textTools.html

The SPECIALIST Lexicon: http://SPECIALIST.nlm.nih.gov/lexicon

Browne AC, McCray, AT, Srinivasan S. The SPECIALIST Lexicon Technical Report, 6/2000, http://lexsrv3.nlm.nih.gov/SPECIALIST/Projects/lexicon/current/release/LEXICON/DOCS/techrpt.pdf

Smith L, Rindflesch T, Wilbur WJ. MedPost: a part-of-speech tagger for bioMedical text Bioinformatics. 2004 Sep 22;20(14):2320-1.

Manning CD, Schütze H. Foundations of Statistical Natural Language Processing, 2003 Massachusetts Institute of Technology, Chapter 10.

Cutting D, Kupiec J, Pedersen J, Sibun P. A Practical Part-of-Speech Tagger, D. Cutting, J. 1992, Proceedings of the Third Conference on Applied Natural Language Processing

5. Hidden Markov Model

See related topics and documents

markov.pdf

6. Tagger Components

The dTagger project includes not only a tagger, but three kinds of training:

Training when you have annotated text,
A tagger that uses a model created by some prior training,
Updating prior trained model, using untagged text,
Training when all you have is untagged text.

There are some additional components

A tag set
A tool to convert the SPECIALIST Lexicon's LRAGR table to a .lex file

All the tagger components assume a lexicon filled with tags. When going about training, it is best to build a model using an annotated set of documents or corpus. The more the better. It is realized that building an annotated corpus is a large task in and of it self. If there is not an abundance of tagged text to train on, it fruitful to create an initial model with a little bit of tagged text, then update the model by running the update with a lot of untagged text. It is impossible to find even a few annotated sentences to train on, the ability to train using just untagged text exists.

Train With Annotated Text

See related topics and documents

TrainWithTaggedText

Tagging

See related topics and documents

Tag

Update with Untagged Text

See related topics and documents

Update with Untagged Text

Train with Untagged Text

See related topics and documents

Train with Untagged Text

The SPECIALIST Tagset

See related topics and documents

T1TagSet.txt.html

Utility to convert LRAGR to a .lex file

See related topics and documents

LRAGR2Lex

Utility to create adjs from Verbs

See related topics and documents

VerbsAsAdjs

Morphology Discovery

See related topics and documents

Morphology Discovery

MakeLexicalLookupIndexes

7. Team Members

This was a collaborative effort initially involving Destinee Nace, who provided the background material needed to comprehend tagger technologies.

Russell Loane provided the first two iterations of the hidden Markov Model code used, and provides much needed support in the way of brainstorming, figuring out anomalies and the like.

Allen Browne has provided the linguistic expertise needed.

Guy Divita took Russell's ideas and code, merged it into a form that is compatible with the TextTools, and (hopefully) made improvements and additional contributions to it.

8. About the Name

The dTagger started out as a collaboration with Destinee Nace. It was initially named "The Tagger Destinee", but this became just too much to write when referring to it, and consequently got shortened to dTagger.

9. Version/History

0.0.1

Version 0.0.1

Released November 11, 2006

Top

	Home \| Table of Contents \| Overview Map \| Icon Legend
	SPECIALIST dTagger
	Contact umlslex@nlm.nih.gov

Lexical Systems Group Cognitive Science Branch Lister Hill National Center for Biomedical Communications National Library of Medicine National Institutes of Health Department of Health & Human Services Copyright - Privacy - Accessibility	Powered By MindManager X5
	Last updated: 11/11/2006