System Summary and Timing

Carnegie Mellon University

General Comments

The timings should be the time to replicate ruus from scratch, `lot including trial runs, etc. The times should also
be ttasonably accurate. This sometilnes will be difficult, such as getting tot~ time for document indexing of huge
text sections, or m~~iually building a knowledge base. Please do y~~ur best.


I.      Construction of indices, knowledge b~i~es, and other data structures (please describe all data structures that
        your system needs for se£'irc'1ing)

        A. Which of the following were used to build your data structures'?
               1. stopword list
                     No. But the NLP/m()rph()l()gical-analysis components of the system do limit the
                     possible lexical categories of SoniC English words to eliminate useless ambiguities.
                     For example, `9l)ut" is given lexical category "cnj" (conjunction) and not alternative,
                     possible categories, such as "sn" (5ingular.n()un); "can" is limited to category
                     "auxm" (nl(KIal.auxiliary.verl)) and not "sn91; etc. Such selective restrictions have
                     some (~f the effects (jf "stop-word" lists, since spurious (or irrelevant) categories will
                     not enter into later indexing stages.
                     Furthermore, the NLPlparsing components of the system return simplex noun
                     phrases (NI's) as candidate terms in which some components of the NP are
                     eliminated, such as (luantitiers (e.g., "many", "one", etc.), determiners (e.g., "the",
                     "a", etc.), and c(Jnjuncti()ns (e.g., "and", "or", etc.). In addition, in normal CLARIT
                     NP processing, the parser does not return prepositions, non-NP adverbs, and
                     extra-NP elements. Tli is practice, therefore, aLso has the effect of eliminating items
                     that normally appear on "stop-word" lists. It clearly goes beyond that practice in
                     eliminating all extra-NP words as well.
                     a. how many words in list'?
                            Approximately 1(~() lexical items have been given restrictive syntactic
                            treatment, ill addition t(~ the words with unambiguously empty categories.
               2. is a controlled v(~abul~'Lry used'? No
               3. stemming No
                     a. st~ind£~d stelnining algorithms
                            which ones'?
                     b. Inorpholo~'ical alialysis
                            Yes. The Morph component of the system provides for comprehensive
                            inflectional.m()rph()l()gical analysis. In practice, the morph-i~ormal form of
                            nouns and adjectives is used in the NP-based terms of the system.
                            Participles are not morphologically reduced (though it is possible to do so).
                            Derivati()nal.m()rphol()gical analysis is not used. A lexicon of approximately
                            ~ `r()()t-f~)rnl' items (English words) is the principal resource used by
                            Morph in addition to its morphological rule set.
               4. tenn weightin"
                     Yes/No. The CLARiT process uses NLP to identify candidate terms in route to
                     indexing, development of ~ss()ciated resources (e.g., thesauri), and analysis of queries
                     or topics. These are taken as the `information units' of interest and are analyzed
                     statistically and heuristically. `Weights' are ass('ciated with terms at various stages
                     of pr(wessing. In indexing TREC documents, for example, an IDFfrF score was
                     associated with terius for each document. In the case of multi-word terms (the


                                          494