SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Appendix C: System Features
appendix
National Institute of Standards and Technology
Donna K. Harman
System Summary and Timing
New York University
General Coinments
The fimings should be the time to replicate runs from scratch, not including trial runs, etc. The times should also
be reasonably accurate. This sometimes will be difticult, such as getting total time for document indexing of huge
text sections, or maiiu[OCRerr]llly building a kliowledge base. Please do your best.
I. Construction of indices, knowledge b%[OCRerr]es, and other data structures (please describe
your system needs for se[OCRerr]'irching)
A. Which of the fo11owin[OCRerr] were used to build your data structures?
1. st()pw()rd list yes
a. how many words in list.? 38()
2. is a controlled v([OCRerr]abul[OCRerr]u[OCRerr]y used? Ilo
3. stemming yes
a. st£[OCRerr]d[OCRerr][OCRerr]d stemming [OCRerr]dgorithIns 110
b. m()1'ph()logical ([OCRerr](dysis yes
4. teim weighting yes
5. phrase discovely yes
a. whit kind of phr£ise? NI[OCRerr]'s Vl[OCRerr]'s, others
b. using st[OCRerr]itistical meth(xls partially
C. using syntictic methods yes
6. syntictic p£[OCRerr]sing yes
7. word sense dis[OCRerr]unbiguati()n no
8. licuristic associations yes
a. short definition of these [OCRerr]iss()ciations synonymy, specializations
9. spelling checkiu[OCRerr] (with in[OCRerr]uiual CoflectiOn) 110
10. spelling ColTection no
11. proper noun identific£ttion algorithm partial
12. tokenizer (recognizes dates, phone numbers, coinmon patte[OCRerr]s) no
13. [OCRerr]ire the m[OCRerr]inuLilly-indexed terms used? 110
14. other techniques used to build [OCRerr]ta structures (brief description)
all data structures that
B. Statistics on d[OCRerr]ta structures built from TREC text (please fill out each applicable section)
1. inverted index
[OCRerr]i. toial unount of st()r£ige (megabytes) 29([OCRerr] MB (().5 GI)yte txt)
b. toLd computer time to build ((Ipproxilnate number of hours) 25()
c. is the pr([OCRerr]ess completely [OCRerr]iutomatic? yes
d. [OCRerr]ire term positions wi[OCRerr]in d('cuments stored? yes
e. sin[OCRerr]le terms only.? 110
3. knowledge bases yes
a. toLd £[OCRerr]ount of stol-Lige (meg[OCRerr]'ibytes) (1.5
b. toLd number of concepts represented 3262
c. type of represenLitioll (fr(unes, semantic nets, rules, etc.) ass(lciations
d. total computer time to build (approximate number of hours) 175
C. to[OCRerr]l minual time to build (ipproxilnate number of hours) (I
f. use of manual lalx)r none
g. auxili[OCRerr]iry files needed fi)r machine use
(1) m[OCRerr]ichine-readible dictionny (which one?) OALD
477