# ---------------------------------------------- # SPECIALIST Tagset (T1) # This is the tagset based on the SPECIALIST # categories. # # See http://lexsrv3.nlm.nih.gov/SPECIALIST/ # Projects/lexicon/2006/release/ # LEXICON/DOCS/techrpt.pdf # for a definition of the principle categories # # The punctuation categories tags came primarily # from the conversion that Larry Smith does to convert # his tagset to the SPECIALIST Tags. [Need a cit here] # # The Shape tags come from shapes that the Xerox # Parc tagger identified, or from shapes that # the textTools identifies or hopes to identify. # # A "1" in the open class column indicates this tag # is an open class. This info is useful to know when # guessing a class - we only want to guess open classes. # An open class is defined to be a lexical category whose # membership is typically large and which can easily accept # new members. In English, this includes the categories of # noun, verb, and adjective. [A Dictionary of Grammatical # Terms in Linguistics, R.L. Trask, c. 1993, pg. 195,] # # We will presume that we have a lexicon filled with # all the closed class words and tokens. # # Those tags that have a 1 in the Shape column, # indicates tags that won't be seen identifying # a term within an official lexicon. Rather # these tags will be put on terms recognized # by shape identifiers such as numbers, url ... # # This tagger heavily relys upon the END tag. This should # be present in all tagsets associated with this tagger. # The END tag is a tag that is implisitly put before # the beginning and after the end of an utterance (sentence). # # The java code needs to know about numbers and punctuation. # The num and punt tags should always remain in any tagset, # as represented here - or alter the TagSet.getNumberTagId() # and TagSet.getPunctuationTagId() methods to correspond # to the # #-+----------+---------------------------------+-----+-----+----------- # | POS | |Open |Shape|Example # | tag | Name |Class| |Character #-+----------+---------------------------------+-----+-----+----------- end |END |0| | noun |noun |1| | adj |adjective |1| | adv |adverb |1| | verb |verb |1| | aux |auxilliary verb "be", "do" |0| | modal |modal verb "have" |0| | to |infinitive marker to |0| | conj |conjunction |0| | pron |pronoun |0| | compl |complementizer(that) |0| | det |determiner |0| | pos |genitive marker |0| | prep |preposition |0| | num |number or numeric |0|1| real |real number |0|1| unknown |unknown |0|1| punct |punctuation |0|1| pd |end of sentence period |0| |. cm |comma |0| |, hy |hyphen |0| |- cl |colon |0| |: ; |semiColon |0| |; ap |right quote or double quote |0| |'" bq |left quote (backquote) |0| |` lp |left parenthesis |0| |( rp |right parenthesis |0| |) ~ |tilda |0| |~ ! |bang |0| |! @ |at sign |0| |@ pound |pound sign |0| |# $ |dollar sign |0| |$ % |percent sign |0| |% ^ |carrot sign |0| |^ & |and sign |0| |& * |asterisk |0| |* = |equal sign |0| |= _ |underBar sign |0| |_ + |plus sign |0| |+ { |left curly bracket |0| |{ } |right curly bracket |0| |} bar |bar |0| || [ |left bracket |0| |[ ] |right bracket |0| |] \ |backslash |0| |\ / |slash |0| |/ < |lessThan |0| |< > |greaterThan |0| |> ? |questionMark |0| |? tab |tab |0| | shape |shape |0|1| prefix |prefix |0|1| money |money |0|1| phone |phonenumber |0|1| date |date |0|1| url |URL |0|1| email |EMAIL address |0|1| unitOfMeasure|unit of measure |0|1| chem |chemical |0|1| propername |proper name |0|1| acronym |acronym |0|1| localAcronym |local acronym |0|1| percent |percent number |0|1| fraction |fraction |0|1| range |range |0|1| glob |glob |0|1| equation |equation |0|1| levelOfSignificance|level of significance |0|1| experimentSize|experiment size |0|1| none |none |0|0|