SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Natural Language Processing in Large-Scale Text Retrieval Tasks chapter T. Strzalkowski National Institute of Standards and Technology Donna K. Harman limited scale. The rapid progress in Computational Linguis- tics over the last few years has changed this equation in various ways. First of all, large-scale resources became available: on-line lexicons, including Oxford Advanced Learner's Dictionary (OALD), Longman Dictionary of Contemporary English (LDOCB), Webster's Dictionary, Oxford English Dictionary, Collins Dictionary, and others, as well as large text corpora, many of which can now be obtained for research purposes. Robust text-oriented software t(x)ls have been built, including part of speech taggers (stochastic and otherwise), and fast parsers capable of processing text at speeds of 2600 words per minute or more (e.g., TTP parser developed by the author). While many of the fast parsers are not very accurate (they are usually partial analyzers by design),2 some, like TTP, perform in fact no worse than standard full-analysis parsers which are many times slower and far less robust. 3 An accurate syntactic analysis is an essential prerequisite for term selection, but it is by no means sufficient. Syntactic parsing of the database contents is usually attempted in order to extract linguistically motivated phrases, which presumably are better indi- cators of contents than "stafistical phrases" where words are grouped solely on the basis of physical proximity (e.g., "college junior" is not the same as "junior college"). However, creafion of such com- pound terms makes term matching process more complex since in addition to the usual problems of synonymy and subsumption, one must deal with their structure (e.g., "college junior" is the same as "junior in college"). In order to deal with structure, parser's output needs to be "normalized" or "regularized" so that complex terms with the same or closely related meanings would indeed receive matching representa- tions. This goal has been achieved to a certain extent in the present work. As it will be discussed in more detail below, indexing terms were selected from atnong head-modifier pairs extracted from predicate- argument representations of sentences. Standard IR benchmark collections are statistically too small and the experiments can easily produce counterintuitive results. For example, Cranfield collection is only approx. 180,000 English words, while CACM-3204 collection is approx. 200,000 words. Partial parsing is usually fast enough, but it also generates noisy data: as many as 50% of all generated phrases could be in- correct (Lewis and Croft, 1990). 3 TTP has been shown to produce parse stmctures which are no worse in recall, precision and crossing rate than those generated by full-scale linguistic parsers when compared to hand-coded Treebank parse trees. 174 Introduction of compound terms also compli- cates the task of discovery of various semantic rela- tionships among them, including synonymy and sub- sumption. For example, the term natural language can be considered, in certain domains at least, to sub- sume any term denofing a specific human language, such as English. Therefore, a query containing the former may be expected to retrieve documents con- tairn'ng the latter. The same can be said about language and English, unless language is in fact a part of compound term programming language in which case the association language - Fortran is appropriate. This is a problem because (a) it is a stan- dard practice to include both simple and compound terms in document representation, and (b) term asso- ciations have thus far been computed primarily on word level (including fixed phrases) and therefore care must be taken when such associations are used in term matching. This may prove particularly trou- blesome for systems that attempt term clustering in order to create "meta-terms" to be used in document representation. The system presented here computes term associations from text on word and fixed phrase level and then uses these associations in query expansion. A fairly primitive filter is employed to separate synonymy and subsumption relationships from others including antonymy and complementation, some of which are strongly domain-dependent. This process has led to an increased retrieval precision in experi- ments with smaller and more cohesive collections (CACM-3204), but may be less effective with large databases. We are presently studying more advanced clustering methods along with the changes in interpretation of resulting associations, as signalled in the previous paragraph. Working with TREC topics has also helped to identify other, perhaps unanticipated problems, some of which may render the traditional statistical approach to information retrieval quite unworkable. A typical document retrieval query will specify (in one way or another) a set of concepts that are of interest to the originator of the query. Thus if the user is interested in documents that report on anticipated rail strikes, the following query may be appropriate: (TREC topic 058) A relevant document will report an impending rail strike .... Any other wording can be used so long as it is going to denote a concept to be found in a document. The system's task is then to dis- cover that the same concept is being denoted in both the query and a document, no matter how different the surface descriptions happen to be. In other words, no new information is requested, the query being entirely self contained and completely