SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Classification Trees for Document Routing, A Report on the TREC Experiment
chapter
R. Tong
A. Winkler
P. Gage
National Institute of Standards and Technology
Donna K. Harman
Then finally in Section 5 we offer some observations on the overall value of using CART
as the basis of a document routing system.
2. The CART Algorithm
CART has been shown to be useful when one has access to datasets describing
known classes of observations, and wishes to obtain rules for classifying future observa-
tions of unknown class[OCRerr][OCRerr]xactly as in the document routing problem. CART is particu-
larly attractive when the dataset is "messy" (i.e., is noisy and has many missing values)
and thus unsuitable for use with more traditional classification techniques. In addition,
and particularly important for the document routing application, if it is important to be
able to specify both the misclassification costs and the prior probabilities of class mem-
bership then CART has a direct way of incorporating such information into the tree
building process. Finally, CART can generate auxiliary information, such as the expected
misclassification rate for the classifier as a whole and for each terminal node in the tree,
that is useful for the document routing problem.
Figure 1 shows how the CART algorithm is used first to construct the optimal classifi-
cation tree and then to generate a classification decision. The upper part of the diagram
CART Largest
Training 1Vectors Tree (Tmax)
ii'm'mm
Raw Feature Tree
Tree Tree Optimal
Training Extraction Growing Tree (T*)
Data
Class Specs. Class Priors
Feature Specs. Cost Function
Nested Sub-Trees
(Tmax>T1 >T2... >Tn)
Feature
Vectors
Raw
Test [OCRerr][OCRerr]Featurel11[OCRerr]l..[OCRerr] Classification
Data [OCRerr] Decision
meaturespecs. Optimal
Tree
Figure 1: CART Processing Flow
210