|
|
The dTagger is a Part of Speech (POS) tagger. A POS tagger assigns an unambiguous part of
speech such as noun, adjective, adverb to the words or terms within a text. Such tags are a
necessary component to determining phrase barriers and head assignment commonly done within
noun phrase extractors. Taggers in general, and this tagger is able to tag after some training. The
sources to train from include some text where the parts of speech have already been assigned (an
annotated corpus), a Lexicon of words and their potential parts of speech, and optionally, lots of
plain text within the genre you are planning on using the tagger on. The dTagger is distributed with
a trained model that was trained on the MedPost Corpus, a corpus of Medline abstracts in the
genomics field hand annotated with parts of speech. This corpus is also being redistributed. The
dTagger includes the SPECIALIST Lexicon as well.
Figure 1: An Abstract Highlighted with Parts of Speech
|
|
Even though there are several publicly available POS taggers, we've had needs that motivated us to
write our own. We have wanted a tagger that worked specifically with the SPECIALIST Lexicon.
We wanted a tagger that natively used the tag set that is used within the SPECIALIST Lexicon.
That being said, we wanted a tagger where the tag set was not hard coded in, so that other tag
sets could be used. We wanted a tagger that included the trainer and could be trained on
untagged text. We wanted a tagger that tokenized text into single words but more importantly,
could tokenize text into multi-word terms, the same granularity as that of the SPECIALIST
Lexicon. The SPECIALIST TextTools already include this kind of tokenization. We would also like
this tagger to be flexible enough to be turned to different languages.
|
|
Download the package from here:
- Prerequisites
- Java 1.4.2.2 or greater
- 2 gig of hard disk space
- 300 Mb of Memory. But the more the better.
- Installation Instructions
- Un-jar the dtaggerDist.jar into the location where you want to install the nls
projects. When un- archived, a nls/nlpdirectory will exist.
> jar xvf dtaggerDist.jar
- Change directories to the nls/nlp directory
> cd nls/dtagger
- Invoke the install.[bat|sh]
>install.bat
The install will create the following scripts in the nls/dtagger/bin directory. These are the
scripts to kick off each of the applications:
trainWithTaggedText[.bat]
|
tag[.bat]
|
updateWithUntagged[.bat]
|
trainWithUntaggedText[.bat]
|
DiscoverMorphology[.bat]
|
LRAGR2lex[.bat]
|
VerbsToAdjs
|
MakeLexicalLookupIndexes
|
- Optional Post Installation Actions
- Add the nls/dtagger/bin directory to the $PATH environment variable. This will
enable these programs to be run from any directory.
- Add to the $CLASSPATH environment variable
nls/dtagger/lib/dtaggerProject.jar;
nls/dtagger/config.This will enable applications that have these tools
embedded in them to find the classes and data.
|
|
Smith L, Rindflesch T, Wilbur WJ. MedPost: a part-of-speech tagger for bioMedical text Bioinformatics.
2004 Sep 22;20(14):2320-1.
Manning CD, Schütze H. Foundations of Statistical Natural Language Processing, 2003 Massachusetts
Institute of Technology, Chapter 10.
Cutting D, Kupiec J, Pedersen J, Sibun P. A Practical Part-of-Speech Tagger, D. Cutting, J. 1992,
Proceedings of the Third Conference on Applied Natural Language Processing
|
|
|
|
The dTagger project includes not only a tagger, but three kinds of training:
- Training when you have annotated text,
- A tagger that uses a model created by some prior training,
- Updating prior trained model, using untagged text,
- Training when all you have is untagged text.
There are some additional components
- A tag set
- A tool to convert the SPECIALIST Lexicon's LRAGR table to a .lex file
All the tagger components assume a lexicon filled with tags. When going about training, it is best
to build a model using an annotated set of documents or corpus. The more the better. It is
realized that building an annotated corpus is a large task in and of it self. If there is not an
abundance of tagged text to train on, it fruitful to create an initial model with a little bit of tagged
text, then update the model by running the update with a lot of untagged text. It is impossible to
find even a few annotated sentences to train on, the ability to train using just untagged text exists.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This was a collaborative effort initially involving Destinee Nace, who provided the background
material needed to comprehend tagger technologies.
Russell Loane provided the first two iterations of the hidden Markov Model code used, and provides
much needed support in the way of brainstorming, figuring out anomalies and the like.
Allen Browne has provided the linguistic expertise needed.
Guy Divita took Russell's ideas and code, merged it into a form that is compatible with the
TextTools, and (hopefully) made improvements and additional contributions to it.
|
|
The dTagger started out as a collaboration with Destinee Nace. It was initially named "The Tagger
Destinee", but this became just too much to write when referring to it, and consequently got
shortened to dTagger.
|
|
|
|
Version 0.0.1
Released November 11, 2006
|
|
|
|