5000/cd years/nns ago/rb ,/com as/rb derived/vbn by/in the/di chinese/jj ancients/nns ./per The tagger which we use to process the input text prior to parsing is based upon a bi-gram model; it selects most likely tag for a word given co- occurrence probabilities computed from a relafively small training set.8 While the peak accuracy of the best-tag option of the tagger is predicted to approach 97% (Meteer et al., 1991), we noted that the actual performance on unprocessed WSJ text was in fact somewhat worse. The main problem, it appears, were frequent mistakes in tokenization of input, especially in recognizing sentence boundaries. For example, when a sentence ended with a period but wasn't fol- lowed by at least two blanks or an end-of-line, this and the next sentence would be collapsed together. On the other hand, intra-sentenfial periods (like those following abbreviated words) were occasionally found followed by a new-line character, and the sen- tence was split into two. While the parser contains a provision to deal with the case of collapsed sen- tences, the tags were likely to be incorrect. The fol- lowing example is typical; note tagging errors at the second apostrophe, and plans. Gorbachev was rinining into trouble at home, including the August coup, "which I thought would be the end of it," Mr. Costa says. Still, plans to send the tank to the U.S. somehow moved ahead. Gorbachev/np was/vbd running/vbg into/in trouble/nn at/in home/nn ,/com including/vbg the/di August/np coup/nn ,/com "/apos which/wdt I/pp thought/vbd would/md be/vb the/di end/nn of/in it/pp ,/com "/nn Mr/nn ./per Costa/np says/vbz ./per still/rb ,/com plans/vbz to/to send/vb the/di tank/nn to/to the/di U.S./np somehow/rb moved/vbd ahead/rb ./per WORD SUFFIX TRIMMER Word stemming has been an effective way of improving document recall since it reduces words to their common morphological root, thus allowing more successful matches. On the other hand, stem- ming tends to decrease retrieval precision, if care is not taken to prevent situafions where otherwise unre- lated words are reduced to the same stem. In our The program, supplied to us by Bolt Beranek and New- man, operates in two alternative modes, either selecting a single most likely tag for each word (1,est-tag option, the one we use at present), or supplying a short ranked list of alternatives (Meteer et at., 1991). 177 system we replaced a traditional morphological stein- mer with a conservative dictionary-assisted suffix trimmer. 9 The suffix trimmer performs essentially two tasks: (1) it reduces inflected word forms to their root forms as specified in the dictionary, and (2) it converts nominalized verb forms (e.g., "implementa- tion", "storage") to the root forms of corresponding verbs (i.e., "implement", "store"). This is accom- plished by removing a standard suffix, e.g., "stor+age", replacing it with a standard root ending ("+e"), and checking the newly created word against the dictionary, i.e., we check whether the new root ("store") is indeed a legal word, and whether the ori- ginal root ("storage") is defined using the new root ("store") or one of its standard inflecfional forms (e.g., "storing"). For exatnple, the following definifions are excerpted from the O~)rd Advanced Learner's Dictionary (OALD): storage n [U] (space used for, money paid for) the storing of goods ... diversion n [U] diverting... procession n [C] number of persons, vehicles, etc moving forward and following each other in an orderly way. Therefore, we can reduce "diversion" to "divert" by removing the suffix "+sion" and adding root form suffix "+t". On the other hand, "process+ion" is not 10 reduced to "process Earlier experiments with CACM-3204 collec- tion showed an improvement in retrieval precision by 6% to 8% over the base system equipped with a stan- dard morphological stemmer (the SMART stemmer). Due to time limitations these numbers are not avail- able for TFIEC database at this time. HEAD-MODIFIER STRUCTURES Syntactic phrases extracted from TTP parse trees are head-modifier pairs. The head in such a pair is a central element of a phrase (main verb, main noun, etc.), while the modifier is one of the adjunct £trguments of the head. In the TREC experiments reported here we extracted head-modifier word and fixed-phrase pairs only. While TREC WSJ database is large enough to warrant generation of ktrger coin- pounds, we were in no posiflon to verify their effec- tiveness in indexing (largely because of the tight schedule). We discuss some options below. 9 Dealing with prefixes is a more complicated matter, since they may have quite strong effect upon the meaning of the result- ing term, e.g., Un- usually introduces explicit negation. `~ Definition checking is not implemented yet.