NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Classification Trees for Document Routing, A Report on the TREC Experiment chapter R. Tong A. Winkler P. Gage National Institute of Standards and Technology Donna K. Harman Then finally in Section 5 we offer some observations on the overall value of using CART as the basis of a document routing system. 2. The CART Algorithm CART has been shown to be useful when one has access to datasets describing known classes of observations, and wishes to obtain rules for classifying future observa- tions of unknown class[OCRerr][OCRerr]xactly as in the document routing problem. CART is particu- larly attractive when the dataset is "messy" (i.e., is noisy and has many missing values) and thus unsuitable for use with more traditional classification techniques. In addition, and particularly important for the document routing application, if it is important to be able to specify both the misclassification costs and the prior probabilities of class mem- bership then CART has a direct way of incorporating such information into the tree building process. Finally, CART can generate auxiliary information, such as the expected misclassification rate for the classifier as a whole and for each terminal node in the tree, that is useful for the document routing problem. Figure 1 shows how the CART algorithm is used first to construct the optimal classifi- cation tree and then to generate a classification decision. The upper part of the diagram CART Largest Training 1Vectors Tree (Tmax) ii'm'mm Raw Feature Tree Tree Tree Optimal Training Extraction Growing Tree (T*) Data Class Specs. Class Priors Feature Specs. Cost Function Nested Sub-Trees (Tmax>T1 >T2... >Tn) Feature Vectors Raw Test [OCRerr][OCRerr]Featurel11[OCRerr]l..[OCRerr] Classification Data [OCRerr] Decision meaturespecs. Optimal Tree Figure 1: CART Processing Flow 210