SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Natural Language Processing in Large-Scale Text Retrieval Tasks
chapter
T. Strzalkowski
National Institute of Standards and Technology
Donna K. Harman
limited scale.
The rapid progress in Computational Linguis-
tics over the last few years has changed this equation
in various ways. First of all, large-scale resources
became available: on-line lexicons, including Oxford
Advanced Learner's Dictionary (OALD), Longman
Dictionary of Contemporary English (LDOCB),
Webster's Dictionary, Oxford English Dictionary,
Collins Dictionary, and others, as well as large text
corpora, many of which can now be obtained for
research purposes. Robust text-oriented software
t(x)ls have been built, including part of speech
taggers (stochastic and otherwise), and fast parsers
capable of processing text at speeds of 2600 words
per minute or more (e.g., TTP parser developed by
the author). While many of the fast parsers are not
very accurate (they are usually partial analyzers by
design),2 some, like TTP, perform in fact no worse
than standard full-analysis parsers which are many
times slower and far less robust. 3
An accurate syntactic analysis is an essential
prerequisite for term selection, but it is by no means
sufficient. Syntactic parsing of the database contents
is usually attempted in order to extract linguistically
motivated phrases, which presumably are better indi-
cators of contents than "stafistical phrases" where
words are grouped solely on the basis of physical
proximity (e.g., "college junior" is not the same as
"junior college"). However, creafion of such com-
pound terms makes term matching process more
complex since in addition to the usual problems of
synonymy and subsumption, one must deal with their
structure (e.g., "college junior" is the same as "junior
in college"). In order to deal with structure, parser's
output needs to be "normalized" or "regularized" so
that complex terms with the same or closely related
meanings would indeed receive matching representa-
tions. This goal has been achieved to a certain extent
in the present work. As it will be discussed in more
detail below, indexing terms were selected from
atnong head-modifier pairs extracted from predicate-
argument representations of sentences.
Standard IR benchmark collections are statistically too
small and the experiments can easily produce counterintuitive
results. For example, Cranfield collection is only approx. 180,000
English words, while CACM-3204 collection is approx. 200,000
words.
Partial parsing is usually fast enough, but it also generates
noisy data: as many as 50% of all generated phrases could be in-
correct (Lewis and Croft, 1990).
3 TTP has been shown to produce parse stmctures which are
no worse in recall, precision and crossing rate than those generated
by full-scale linguistic parsers when compared to hand-coded
Treebank parse trees.
174
Introduction of compound terms also compli-
cates the task of discovery of various semantic rela-
tionships among them, including synonymy and sub-
sumption. For example, the term natural language
can be considered, in certain domains at least, to sub-
sume any term denofing a specific human language,
such as English. Therefore, a query containing the
former may be expected to retrieve documents con-
tairn'ng the latter. The same can be said about
language and English, unless language is in fact a
part of compound term programming language in
which case the association language - Fortran is
appropriate. This is a problem because (a) it is a stan-
dard practice to include both simple and compound
terms in document representation, and (b) term asso-
ciations have thus far been computed primarily on
word level (including fixed phrases) and therefore
care must be taken when such associations are used
in term matching. This may prove particularly trou-
blesome for systems that attempt term clustering in
order to create "meta-terms" to be used in document
representation.
The system presented here computes term
associations from text on word and fixed phrase level
and then uses these associations in query expansion.
A fairly primitive filter is employed to separate
synonymy and subsumption relationships from others
including antonymy and complementation, some of
which are strongly domain-dependent. This process
has led to an increased retrieval precision in experi-
ments with smaller and more cohesive collections
(CACM-3204), but may be less effective with large
databases. We are presently studying more advanced
clustering methods along with the changes in
interpretation of resulting associations, as signalled in
the previous paragraph.
Working with TREC topics has also helped to
identify other, perhaps unanticipated problems, some
of which may render the traditional statistical
approach to information retrieval quite unworkable.
A typical document retrieval query will specify (in
one way or another) a set of concepts that are of
interest to the originator of the query. Thus if the user
is interested in documents that report on anticipated
rail strikes, the following query may be appropriate:
(TREC topic 058) A relevant document will report an
impending rail strike .... Any other wording can be
used so long as it is going to denote a concept to be
found in a document. The system's task is then to dis-
cover that the same concept is being denoted in both
the query and a document, no matter how different
the surface descriptions happen to be. In other
words, no new information is requested, the query
being entirely self contained and completely