IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Suffix Dictionaries
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
VI-2
be stored in the system to word stems only. The suffix `5' dictionary is
applied in the same manner, but in this case the only `suffix' removed is the
terminal [OCRerr] with the object of conflating singular and plural word forms.
Many of the considerations relating to the methods Oi construction
of the stem dictionary have been discussed by Salton and Lesk L7]. The comments
made here relate to the extent to which the present dictionaries correctly
conflate English word forms and so use the correct stems.
The conflation of singular and plural words is not perfectly achieved
by terminal "5" removal, although over success is obtained in the case of
the Cran-l aerodynamics terminology. The failures are due to well-known
singular and plural forms such as "body" and t'bodies", "axis" and "axes",
"bureau" and "bureaux", "appendix" and "appendices", etc. Also, the ter-
minal "5" does not always denote a plural form, and words like "bluntness"
and "aerodynamics" have the s" removed. This latter occurrence rarely affects
retrieval, however, since a request and document both containing the word
11bluntness" will match on the word without its terminal "5". It is possible
to imagine a case of incorrect conflation, for example, the word "axe11 could
be incorrectly conflated with "axes", but such occurrences are extremely rare
within the narrow subject fields under test.
The full suffix removal procedure incorporates spelling rules which
correctly identify "bod" as the stem of both "body" and `tbodies11, and correctly
conflate "hope", "hoped" and 11hoping", as well as "hop", "hopped" and "hopping".
There are some cases, however, where the correct stem is not recognized.
For example, the words `1computation", "computations" and `1computational'1 are
correctly conflated and given the same concept number as the look-up procedure,
but a second group of similar words is given a second concept number including
such words as "compute11, 11computed", "computers", "computer", and "computing".