SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Appendix C: System Features appendix National Institute of Standards and Technology Donna K. Harman System Summary and Timing New York University General Coinments The fimings should be the time to replicate runs from scratch, not including trial runs, etc. The times should also be reasonably accurate. This sometimes will be difticult, such as getting total time for document indexing of huge text sections, or maiiu[OCRerr]llly building a kliowledge base. Please do your best. I. Construction of indices, knowledge b%[OCRerr]es, and other data structures (please describe your system needs for se[OCRerr]'irching) A. Which of the fo11owin[OCRerr] were used to build your data structures? 1. st()pw()rd list yes a. how many words in list.? 38() 2. is a controlled v([OCRerr]abul[OCRerr]u[OCRerr]y used? Ilo 3. stemming yes a. st£[OCRerr]d[OCRerr][OCRerr]d stemming [OCRerr]dgorithIns 110 b. m()1'ph()logical ([OCRerr](dysis yes 4. teim weighting yes 5. phrase discovely yes a. whit kind of phr£ise? NI[OCRerr]'s Vl[OCRerr]'s, others b. using st[OCRerr]itistical meth(xls partially C. using syntictic methods yes 6. syntictic p£[OCRerr]sing yes 7. word sense dis[OCRerr]unbiguati()n no 8. licuristic associations yes a. short definition of these [OCRerr]iss()ciations synonymy, specializations 9. spelling checkiu[OCRerr] (with in[OCRerr]uiual CoflectiOn) 110 10. spelling ColTection no 11. proper noun identific£ttion algorithm partial 12. tokenizer (recognizes dates, phone numbers, coinmon patte[OCRerr]s) no 13. [OCRerr]ire the m[OCRerr]inuLilly-indexed terms used? 110 14. other techniques used to build [OCRerr]ta structures (brief description) all data structures that B. Statistics on d[OCRerr]ta structures built from TREC text (please fill out each applicable section) 1. inverted index [OCRerr]i. toial unount of st()r£ige (megabytes) 29([OCRerr] MB (().5 GI)yte txt) b. toLd computer time to build ((Ipproxilnate number of hours) 25() c. is the pr([OCRerr]ess completely [OCRerr]iutomatic? yes d. [OCRerr]ire term positions wi[OCRerr]in d('cuments stored? yes e. sin[OCRerr]le terms only.? 110 3. knowledge bases yes a. toLd £[OCRerr]ount of stol-Lige (meg[OCRerr]'ibytes) (1.5 b. toLd number of concepts represented 3262 c. type of represenLitioll (fr(unes, semantic nets, rules, etc.) ass(lciations d. total computer time to build (approximate number of hours) 175 C. to[OCRerr]l minual time to build (ipproxilnate number of hours) (I f. use of manual lalx)r none g. auxili[OCRerr]iry files needed fi)r machine use (1) m[OCRerr]ichine-readible dictionny (which one?) OALD 477