NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Classification Trees for Document Routing, A Report on the TREC Experiment chapter R. Tong A. Winkler P. Gage National Institute of Standards and Technology Donna K. Harman available or when users interests change. While we would make no claim that CART alone is sufficient to guarantee high-per- formance detection and routing, we do believe that its ability to work automatically with any size data set and with any set of specified features means that it can be a very cost effective component of such a system. Indeed we believe that it is probably best used as an initial filter to screen out non-relevant documents and that CART's output might then be fed to a more language-oriented algorithm to decrease the false alarm rate. To summarize, we consider CART to be an important tool in our arsenal of effective and efficient document detection and routing technologies. While the results from TREC are preliminary, we believe that they do demonstrate that CART has a number of advan- tages over other approaches, namely: * classification trees are constructed automatically from specifications of features and a set of examples, * the learning algorithm generates an "optimal" classifier together with useful auxiliary data and statistics such as the misclassification probabil- ity, * prior class probabilities can be used if known, * specification of the misclassification cost function provides for direct con- trol of the fallout and recall of the classifier, and the classification trees are easily understood and interpreted by end users. In addition, the CART algorithm is completely language independent, in the sense that it make no assumptions about the inherent features of the source language of the docu- ment-all that it requires are "features" and training examples. Further, the features themselves can be extracted from document externals as well as document internals. In TREC-2 we will explore some of the extensions discussed in Section 4 to show how CART can indeed be integrated into a system to help users who are faced with the task of searching through the "information wilderness." References [1] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth & Brooks, Pacific Grove, CA. 1984. [2] S. L. Crawford. Extensions to the CART Algorithm. International lournal of Man- Machine Studies, 31:197-217, 1989. [3] 5. L. Crawford, R. M. Fung, L. A. Appelbaum, and R. M. Tong. Classification Trees for Information Retrieval. Proceedings of the Eighth International Workshop on Machine Learning (ML91). Morgan Kaufmann, San Mateo, CA. 1991. (4] C. Fox. A Stop List for General Text. SIGIR Forum, 41:19-35, Fall/Winter 1989/90. 227