SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Knowledge-Based Searching with TOPIC
chapter
J. Lehman
C. Reid
National Institute of Standards and Technology
D. K. Harman
aggregated by boolean and evidential reasoning operators
and point value uncertainty at the term level (each piece
of evidence has a strength/uncertainty attached to its
predictability of its parent concept).
Topic provides search rule management functions to
support the creation, repeated use, modification, sharing
and display of one or more libraries of related search
rules. The search rule libraries are themselves searchable,
including text annotations of the rules. Search rules are
interactive queries, automatic queries, and a training
mechanism for the installation's domains.
A search rule definition may include several thousand
pieces of evidence in over one hundred levels of detail.
One search rule library may contain twenty thousand
rules. Search rules (topics) are named, and a reference to
the name in a search expression inherits all lower levels
of evidence. Any query which includes a search rule
name will automatically receive the full definition of the
rule in the search. The lowest level of evidence is the
text expression. Search rules may be composed of other
named search rules.
Search rules appear as an alphabetical list of topic
names, an indented outline showing the levels of rules,
or a graphical "family tree" display of rules and their
parents/children, including evidence combination
operators and evidence "weights". Searches may be
executed directly from any node (name) in the search rule
family. A topic search rule graphic display example
appears in Figure 1.
The search rule syntax consists of an exact or fuzzy
match (pattern match) capability for individual terms
(case sensitive); a boolean combination (and (all), or
(any), not), of terms; dual direction, nested, grammatical
(paragraph, sentence, phrase) proximity operators; a
relative (fuzzy) proximity operator for two or more
terms, an evidence aggregation operator (accrue) for both
full-text and structured field data, and inexact match
techniques as follows:
1. wildcard expressions for term expansion; single
character, character group, or character class
2. soundex (first letter common) expressions for
morphological term expansion
3. source language-specific stemming (morphological
variants) expressions for term expansion
4. typographical expressions for term expansion (n-
character infidelity to search term)
5. multi-direction thesaurus (user-modifiable) for term
expansion
6. suggestion (statistical correlation) for term
expansion
7. evidence appearing in a field value, or as the field
value (contains, matches, substring, starts, ends).
Each of the above inexact match techniques may be
executed automatically. Negative evidence may be
applied on a term-by-term basis with any operator. The
structured field data types are character, number and date.
Date arithmetic is provided, as well as relative date
expressions such as "yesterday" ,"today" etc.
2.1.3 SEARCH RESULT RANKING
Results of searches are relevance ranked lists of
documents, with displayed titles or other descriptive
information. The numeric score, and the accompanying
rank, are the result of a best fit comparison of the full-
text document and descriptor content and the search rule
evidence. The ranking is subject to an optional
threshold, used primarily to limit output, but the
threshold may be used to describe search recall and
precision. The relevance threshold is always used in
dissemination/notification.
Evidence consists of terms, operators (syntax) and the
numeric strength of the relationship between the
evidence and its (next higher level) search rule. The
evidence may be aggregated or evaluated with boolean
operators. Aggregation involves giving relevance score
credit for each piece of evidence found (breadth of
evidence first). As each level is evaluated in a search rule
(tree), potential document score modification occurs
(since successive levels may be weighted evidence for
their next broader concept). The scoring of an individual
term may include a frequency-of-occurrence factor (a
normalized concentration factor) , a less powerful scoring
factor than the absolute presence of the evidence in the
document. A document score explanation function is
included.
2.2 AGGREGATh SEARCH
FUNCTIONS
Searches may iterate on the results of the previous
search.
Any search may be named/saved along with its results
manipulation criteria (sorting by fields, grouping) for
later execution. Any search criteria may be interactively
defined as a logical view of the collection, which then
provides many alternative search universes for the user
population. All Topic activities are audited. A search
which supports discretionary access control may be
transparently appended to any users search.
210