norm), the terms are assigned IDF/TF scores, and each word in the term is broken
      out and assigned an independent IDF/TF score.
5. phrase discovery Yes
      a. what kind of phrase?
             Simplex noun phrases (= all moditiers and the head of the NP but no
             deternijners,  t1uantitiers,  or post~head~position m(~ifying phrases    or
             clauses).
      b. using statistical rneth(xis
             No. NPs retained fi)r thesaurus creation are scored using statistically-based
             measures ~)f expected `rarity' (based on component words), distribution,
             fre(1uency, and coverage.     But N1~s are not identified in texts based on
             statistical parsing, for example.
      C. using syiltactic methods
             Yes.     NI's are discovered using a parser that implements a `heuristic'
             grammar.     In particular, following word-for-word morphological analysis
             (resulting in a set of syntactic-category tags t~)r each word encountered in
             a text), the parser identities the sul)se(luences that form NI's. Identification
             of NI's is based on rules that perf~)rm NI'~b()undary-c()ndition tests.
6. syntactic p~irsing Yes (see above). A single-pass parser follows morphol()gical analysis.
7. word sense dis~biguati()n
      No. No attempt is made to control for word senses in morphological or syntactic
      analysis. As noted above, disambiguation of grammatical categories is facilitated by
      restricting possible categories for selective items. In addition, absolute preferences
      are established for grammatical categories appearing in n~~un phrases.
8. heuristic ~~~sociations
      a. short definition of these ~L~5ociations
             Yes. The principal relation the system currently uses is that of `similarity'
             of terms.    `Similarity' is determined by different procedures in different
             contexts. For example, partial or `fuzzy' matching of terim~ is facilitated by
             noting whether terms share words or attested sul)phrases. For example, in
             vector-space modeling of documents, the contained words of all terms (in the
             document vector as well as the query vector) are broken out, giving, in
             effect, the possibility ~ matching parts of terms (though, technically, the
             individual words are realized as independent dimensions of the space). In
             addition, in nominating terms for inclusion in thesauri and in matching
             terms to thesauri, CLARIT processing takes account of contained words and
             attested sul)phrases.
9. spelling checkin~ (with rn~'~nu~~l ColTection) No
10. spelling correction No
11. proper noun identification ~Ll~()ri~In
      YesINo. The system provides for identification ~ `candidate proper nouns' b~~ed
      on morphological analy %~~%  (F%sentially, since the morphological analysis is virtually
      exhaustive for English, words that cannot be mapped to specific lexical ite~s are
      given the provisional label "cpn"--'candidate proper noun'--and parsing proceeds
      accordingly.)   There is a facility in CLARIT for highly-reliable proper name
      (including acr(~nym) identification, but it was not used in this round of TREC
      processing.
12. tokenizer (recognizes dates, phone numbers, ColTilTIOll pattenis)
      a. which patteills ~ tokenized?
             Certain common abbreviations are included in the lexicon and, under
             morphol(~gical processing, are rendered into normalized forius. The system
             can     utilize--and  even  partially  discover--supplemental  lexicons  of
             domain-specific abbreviations and other phrasal-lexical patterns, but this
             facility was not used for TREC processing.


                                 495