NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Knowledge-Based Searching with TOPIC chapter J. Lehman C. Reid National Institute of Standards and Technology D. K. Harman aggregated by boolean and evidential reasoning operators and point value uncertainty at the term level (each piece of evidence has a strength/uncertainty attached to its predictability of its parent concept). Topic provides search rule management functions to support the creation, repeated use, modification, sharing and display of one or more libraries of related search rules. The search rule libraries are themselves searchable, including text annotations of the rules. Search rules are interactive queries, automatic queries, and a training mechanism for the installation's domains. A search rule definition may include several thousand pieces of evidence in over one hundred levels of detail. One search rule library may contain twenty thousand rules. Search rules (topics) are named, and a reference to the name in a search expression inherits all lower levels of evidence. Any query which includes a search rule name will automatically receive the full definition of the rule in the search. The lowest level of evidence is the text expression. Search rules may be composed of other named search rules. Search rules appear as an alphabetical list of topic names, an indented outline showing the levels of rules, or a graphical "family tree" display of rules and their parents/children, including evidence combination operators and evidence "weights". Searches may be executed directly from any node (name) in the search rule family. A topic search rule graphic display example appears in Figure 1. The search rule syntax consists of an exact or fuzzy match (pattern match) capability for individual terms (case sensitive); a boolean combination (and (all), or (any), not), of terms; dual direction, nested, grammatical (paragraph, sentence, phrase) proximity operators; a relative (fuzzy) proximity operator for two or more terms, an evidence aggregation operator (accrue) for both full-text and structured field data, and inexact match techniques as follows: 1. wildcard expressions for term expansion; single character, character group, or character class 2. soundex (first letter common) expressions for morphological term expansion 3. source language-specific stemming (morphological variants) expressions for term expansion 4. typographical expressions for term expansion (n- character infidelity to search term) 5. multi-direction thesaurus (user-modifiable) for term expansion 6. suggestion (statistical correlation) for term expansion 7. evidence appearing in a field value, or as the field value (contains, matches, substring, starts, ends). Each of the above inexact match techniques may be executed automatically. Negative evidence may be applied on a term-by-term basis with any operator. The structured field data types are character, number and date. Date arithmetic is provided, as well as relative date expressions such as "yesterday" ,"today" etc. 2.1.3 SEARCH RESULT RANKING Results of searches are relevance ranked lists of documents, with displayed titles or other descriptive information. The numeric score, and the accompanying rank, are the result of a best fit comparison of the full- text document and descriptor content and the search rule evidence. The ranking is subject to an optional threshold, used primarily to limit output, but the threshold may be used to describe search recall and precision. The relevance threshold is always used in dissemination/notification. Evidence consists of terms, operators (syntax) and the numeric strength of the relationship between the evidence and its (next higher level) search rule. The evidence may be aggregated or evaluated with boolean operators. Aggregation involves giving relevance score credit for each piece of evidence found (breadth of evidence first). As each level is evaluated in a search rule (tree), potential document score modification occurs (since successive levels may be weighted evidence for their next broader concept). The scoring of an individual term may include a frequency-of-occurrence factor (a normalized concentration factor) , a less powerful scoring factor than the absolute presence of the evidence in the document. A document score explanation function is included. 2.2 AGGREGATh SEARCH FUNCTIONS Searches may iterate on the results of the previous search. Any search may be named/saved along with its results manipulation criteria (sorting by fields, grouping) for later execution. Any search criteria may be interactively defined as a logical view of the collection, which then provides many alternative search universes for the user population. All Topic activities are audited. A search which supports discretionary access control may be transparently appended to any users search. 210