APPENDIX C This appendix contains the supplemental forms filled out by each group about their system. These forms are meant to supplement the papers and contain a standarded and formatted description of system features and timing aspects. 435 System Summary and Timing City University, London General Cominents The timings should be the time to replicate runs from scratch, not including trial runs, etc. The times should also he reasonably accurate. This sometimes will be difficult, such as getting total time f(ir document indexing of huge text sections, or mailually building a kilowledge base. Please do your best. I. Construction of indices, knowledge bases, and other data structures (please describe all data structures that your system needs for searching) A. Which of die fi)lk)wing were used to build your data structures? 1. stopword list yes a. how Inaily words in list? 126 general stop words + 6 function words. Excluded from indexes and (lueries. Semi-st()pw()rd list of 256 words and phra~es. These are not used in query expansion following relevance feedl)ack unless they occur in the original query. 2. is a controlled v(~abulL.uy used'? No. But see I C I I). 3. stelnining yes a. st~uid~'u'd stelninilig ~`LIgon' thins A moderately weak suffixing algorithm Ilased on M. F. Porter, "An algorithm for suffix stripping." Program, 14(3), Jul 1980, 130-137. We also use a degree (~f British/American spelling contlation. b. morphological ~alysis no 4. terin weighting No. Query terms are weighted, l)ut not index terms. 5. phrase discovery 110 6. syntactic parsing no 7. word sense disainbigu£~tion 110 8. heuristic associations no 9. spelling checking (with manual correction) no 10. spellin~ correction no 11. proper noun identific~'ition al'~orithm no 12. tokenizer (reco(2nizes dates, phone numbers, coininon pattenis) no 13. are the manually-indexed terins used'! ~() B. Statistics on &L'ita structures built ~in TREC text (please fill out each applicable section) 1. inverted index ti. total ainount of storage (megabytes) 810 b. t()t~'~l computer tilne to build (approxil nate number of hours) 43 c. is the ~1*()(:C55 completely £`iutomatic? yes d. ~`u'e terin positions within d(~uments stored'? No. Insufficient disk space to do this. C. single telins only'? Single terms and pre-specified phr~ses (see I C I 1) l)elow) C. Data built from sources other th~ui the input text 1. interii~'illy-built auxiliary files One manually-Iluilt tile. a. domain independent or domain specific (if two separate files, please fill out one set of (juestions for each file) Loosely domain-dependent b. type of file (thesaurus, knowledge base, lexicon, etc.) Small quasi-thesaurus containing synonym classes, prefixes, go phrases, 436 st()pw()rds, function words and semi-stopwords (see I A 1 a for semi-st()pw()rds). C. total ainount of StOra~LTC (megabytes) 0.013 d. total number of concepts represented About 1500 e. type of representation (fr~es, seinailtic nets, rules, etc.) Simple f. total computer t~e to build (approximate number of hours) Manually built. Structured at runtime, time negligible. g. total mai1u(~l tilne to build (approximate nuinber of hours) Perhaps 8 person.h()urs. Several iterations, based on fre(luency counts from indexing runs, other similar files, TREC (lueries and documents. li. use of manu£~l labor (4) o~er (describe) Manually built using text editor 2. extenially-built auxiliary file no lookup table II. Query construction (please fill out a section for each query construction method used) A. Automatically built queries (ad hoc) 1. topic fields used Concepts. Other tields were tried but gave overall (though not uniformly) worse results. 2. total computer tilne to build query (cpu seconds) 0.02 secondsI(~uery to parse topic and extract 3. which of the ft)llowin~ were used? j. other (describe) Concept terms processed and weighted. Term weight = constant * log(((r+c)I(R.r+1-c)) I ((n.r+c)I(N-n.R+r+1-c))) where N is the nullll)er of indexed documents, n the number of documents containing the term, R the nunll)er of known relevant documents, r the nullll)er of known relevant documents containing the term, c = 0.5. Weights rounded to nearest integer. B. Manually constructed queries (ad h(~) 1. topic fields used Any: searchers' free choice. 2. average tilne to build query (minutes) About 40 minutes (often including trial searches) 3. type of query builder Six searchers were used. None was a domain expert. Two might be described as experts on the search system. 4. t(iols used to build query c. other lexical tools (identify) Trial lookups giving fre(luency. Trial searches. 5. which of the f~)llowing were used? a. terin weighting As in II A 3 j above b. Boolean connectors (AND, OR, NOT) All available. AND and OR were used in a number of searches. d. addition of terins not included in topic (1) source of terms Searchers' world knowledge and terms from relevant documents found in trial searches. C. Feedback (ad hoc) 1. initial query built by metliod 1 or meth(XI 2? Method 2 2. type of person doing feedback Searchers were Masters students in Information Science and two people working on the TREC project. 437 3. aver~ige tilne to do complete feedback a. CPU time (total CPU .`;econd.~ for all iterations) About 20 seconds b. ck~k tilne from initial Construction of query to completion of final query (minutes) About 20 minutes 4. average number of iteratk)ns One a. average nwnber of d(~uInCnts exainined per iteration About 20 5. minimum number of iterations One 6. maximum number of iterations One 7. what determines the end of an iteration? Searchers were rec~~mmended to stop after assessing 20 documents or when they had found 10 relevant documents. These guidelines were not always adhered to. 8. feedback methods used b. automatic query expansion from relevant d(Kuments (2) only top X tefins added (what is X) Term p(x)l was all ~~uery terms + all non-semi-stop terms from relevant documents. The former were given an R-value of R + 3 and an r-value of r + 2. Top 20 terms were used, selected in descending order of (term~weight * r) and weighted using the formula given previously. II II II See section 11 A 3 j & I A I a for "R , r , semi-stop". D. Automatically built quefles (routing) 1. topic fields used Concepts 2. total computer tilne to build query (cpu seconds) Depended strongly on number of known relevant documents in training set and their length. Average perhaps 10 minutes. 3. which of the followin(j were used in building the query? a. terms selected from (1) topic (3) only docuineuL'; with relevance judginenis b. tenri weighting As in II A 3 j above except R = R + 10 and r = r + 10 for concept terms. III. Searching A. Total computer tilne to search (cpu seconds) 1. retrieval tilne (total cpu secon(1~ between when a query enters the system until a list of document numbers are obtained) Typical figure for 12-term search producing output list of 350,000 document identifiers: 45 seconds (note that in an interactive production system we would use a weight threshold which would reduce this by perhaps 50%). 2. ranking time (total cpu seconds to sort d&~ument list) For list above: about 65 seconds (weight threshold would reduce by 50-90%). B. Which metliods best describe your machine searching methods? 2. probabilistic model C. What factors are included in your ranking? 2. inverse document frequency Inverse document frequency and relevance information when available (see above for weighting scheme). IV. What machine did you conduct the TREC experilnent on? Sun SPARC Server 41330 with Sun IPC as fileserver How much RAM did it have? 16 megabytes 438 What w~'L~ the clock rate ot' ~e CPU? Not specified. Sun claims 16 Mips. V. Some systems are research prototypes and others are com'nerci£~. To help compare tliese systems: 1. How much "soflware en(~iI'eering" went into the development of your system? System is non-commercial. It has undergone continual modification since 1982 to meet the requirements of a number of different research projects, mainly on end-user bibliographic search mg. 2. Given appropriate resources, could your system be made to run faster? By how much (estimate)? Faster hardware would of course increase speed. Main bottleneck is disk & network `10. Very large amounts ~`f RAM (of the order of a gigabyte per process--or could be shared between processes searching the same database) would greatly reduce `10 dependence. On the software side, earlier versions of the system were often optimised f('r speed at s(~me cost in added complexity and reduced flexibility. This optimisation has been removed from the version produced for TREC, partly because interactive searching by general users was not envisaged. It is impossible to give definite estimates on speed improvements. It would not be unreasonable to expect an order of magnitude improvement within current hardware and software c(~nstraints. 3. What features is your system missing that it would benefit by if it had them? Given enough disk we would have stored positional information in the indexes, and probably used it to m(~dify document weights, perhaps by giving weight bonuses for term proximity. This would have increased inversion storage overheads to a little over 1(~~% of bibliographic tile size. (This is not really a "missing feature," because the system d(~s have the capability.) We might have considered some form of weight adjustment for document length. This would involve a modification of the index structure which might just have been feasible within the disk constraints. Other possibilities worth investigating include phrase discovery and term dependency statistics. 439 System Summary and Timing University of Pittsburgh General Commenis The timings should be the time to replicate rujis from scratch, not including trial runs, etc. The times should also be reasonably accurate. This sometimes will be difficult, such as getting total time for document indexing of huge text sections, or manually building a knowledge base. Please do y~~ur best. I. Construction of indices, knowledge b~'tses, and other data structures (please describe all data structures that your system needs for searching) A. Which of the ft)llowing were used to build your data structures? 1. stopword list a. how many words in list'? 2,529 words on the list, including digital (0-9). 3. stemmin~2 a. standard stemming algorithms We use Porter stemming algorithm and it was implemented by C. Fox, using C programs. 4. telin weighting 5. phrase discoveiy 6. syntactic parsin&' 7. word sense dis~'unbi(2uation 8. heun's+~ic ~`L';sociations 9. spellint~ checking (with manual correction) 10. spelling correction 11. proper noun identification algoritlim 12. tokenizer (recognizes dates, phone numbers, common patterns) 13. are the manually-indexed terins used'! 14. other techniclues used to build d£~ta structures (brief description) B. Statistics on data structures built from TI{EC text (please fill out each applicable section) 1. inverted index a. total ainount of 5tOra~~~C (me~abytes) For the storage space information, only the data on dLsk one is available. Following table provides the data in Megabytes unit. DOE AP ZIFF WSJ Inverted tiles 162.3 199.8 143.7 223.4 Indexed tiles 2.2 2.4 2.1 Address tiles 4.3 1.7 1.7 2.6 Note: * Data on FR is also n(Jt available (loaded into tapes); * Address tiles are the indexed tiles which include document numl)ers and their offspring in the text tiles where the document stored. b. total computer time to build (approximate number of hours) Please refer t(~ TahIe 1 in (~ur paper. c. is the pr~icess completely automatic? yes 440 d. £tre teun po.~iti()n.'; witliiu d(~umeflts ~tored? no e. .`;ingle tennN only? yes C. Data built from sources other thaii the input text --no II. Query construction (please fill out a section fi)r each query construction method used) A. Automatically built quenes (ad hoc) 1. topic fields used Training queries: Title and concepts are used. However, the nationality might l)e included if it's necessary to meet the narrative item. Routing queries: The routing queries are the final converged queries from the training queries. There is 110 moditication. Ad hoc queries: Title, concepts are used, and some keywords from narrative items are added. 2. total computer time to build query (cpu seconds) Computing time to l)uild queries is not availaille. 3. which of the ft)llowiIl~ were used,? a. term weighting with weights b£~sed on te~s in topics Term weighting for queries is assigned ily the system. It is our research t(~pic on term weights modification. Note the stemming algorithm used on document processing was aLso used on query terms. Training queries: All term weights were assigned automatically hy the system and also adjusted l)y the system using feedl)ack information. Routing queries: The term weights are those from the last generation of the training queries. No changes are applied. Ad hoc queries: For one query individual the term weights were assigned manually hy the researchers. The other query individuals' term weights were generated hy the system. (Note: our system uses 10 query individuals searching documents simultaneously.) Also the term weights were adjusted hy using the feedl)ack information. C. Feedback (ad hoc) 1. initial query built by method 1 or Ineth(xl 2? Initial queries were l)uilt Ily method I (automatic). 2. type of ~C~5()fl doing feedback Evaluation is done l)y ()U~ researchers. 3. average time to do complete feedback Please refer to TaIlle 2 in our paper. 4. average number of itera(i()ns 3 iterations on average. 5. minimum number of iterations 0 6. maximum nuinber of iterations 9 7. what determines the end of an iteration? No more relevant documents are retrieved or it is not valuahle to do more feedl)ack due to the time constraint. 8. feedback methods used Query terms are automatically modified hy the system using the genetic algorithm in our system. III. Searching A. Total computer time to search (cpu seconds) 1. retrieval time (total cpu seconds between when a query enters the system until a list of document numbers are obtained) Please refer to TahIe 2 in our paper. 2. ranking time (total cpu seconds to sort d(lcument list) not availaille 441 B. Which rneihod.~ best descnbe y~~ur machine se£irching methods? 1. vector space m(xiel A distance function (Lp metric) is used as similarity measurement. C. What factors ~ire included in your railking? 15. other (.`;pecify) Document ranking bases on the distance. The shorter the distance, the high the rank. That is, the document with the shortest distance is put on the top of the list. IV. What machine did you conduct the TREC experilnent on? How much RAM did it have? What was the clock rate of the CPU? Two types of systems are used. Sun-670: 32 MB RAM and 40 MHz CPU clock rate; Sun SPARC/IPC: 24 MB RAM and 25 MHz clock rate. V. Some systems are research prototypes and others are commercial. To help compare these systems: 1. How much "software engineerin(~" went into the development of your system? 2. Given appropnate resources, could your system be made to run f~';ter? By how much (estimate)? If our system can be implemented on a parallel machine, the retrieval could be 10 times faster. 3. What features is your system missin{' that it would benefit by if it had them? There are a lot (~ parameters which can l~ adjusted to make our system more flexible and more adaptive. We need to build a go(~ user interface on which several parameters can be controlled and manipulated by the users. 442 System Summary and Timing Cornell University Run 1: Single term automatic ad hoc run (global/local match) General Comments The timings should be the time to replicate fulls trom scratch, not including trial runs, etc. The times should also be reasonably accurate. This sometimes will be difficult, such as getting total time for d(lcument indexing of huge text sections, or manually building a kilowledge base. Please do your best. I. Construction of indices, knowledge bases, and other data structures (please describe all data structures that your system needs for searching) A. Which of the following were used to build your data structures? 1. stopword list a. how many words in list'? 570 2. is a controlled v(~abulary used'? no 3. stelnining yes a. standard steniming algorithms which ones.? SMART 4. tenn weighting In docs, tf * idt; cosine n(~rmalization (ntc) In (lueries, tf * idt; Cosine normalization (ntc) In sentences, tf * idI, no normalization (ntn) 5. phrase discovery 6. syntactic parsing 7. word sense disainbiguation 8. heuristic associations 9. spelling checking (with manual correction) 10. spelling correction 11. proper noun identification algorithm 12. tokenizer (recognizes dates, phone numbers, COmlflOfl pattenis) 13. are the manually-indexed terins used'? 14. other techniques used to build data structures (brief description) B. Statistics on data structures built from TREC text (please fill out each applicable section) 1. inverted index a. total amount of storage (megabytes) 690 b. total computer time to build (approximate number of hours) 4.7 hours to create doc vectors from text 0.7 hours to reweight doc vectors and produce inverted file c. is the ~f(lcC55 completely automatic? yes d. are term positions within d(~uments stored'? no e. single terms only'? yes 5. other dat~t structures built from TREC text (what'?) Map from d(lc'id to text location (also gives title for each doc) a. total amount of st()r~y&1c (megabytes) 68 MI)ytes. b. total computer time to build (approximate number of hours) Time to create included in inverted file creation al)ove. c. is the ~f(~C55 completely automatic? Automatic 443 d. brief descflpti()u of methods used other data structures built from TREC text (what?) Map from internal concept to token string a. total (i[n()unt of st()r~ge (megabytes) 18 Ml)ytes b. total computer tizue to build (approxu nate number of hours) Time to create included in inverted file creation above. C. is the pr(~ess completely automatic? Automatic d. brief desaiption of methods used C. Data built from sowces other th~~ the input text None, other than st()pword file. II. Query construction (please fill out a section for each query construction method used) A. Automatically built queries (ad hoc) 1. topic fields used Topic, Nationality, Narrative, Concepts, Factors, Description 2. toLd computer tilne to build query (cpu seconds) 1.5 seconds for all (lueries 3. which of the ft)llowing were used? a. term weightin(~ with weights b£Lsed on terms in topics (idf) III. Searching A. Total computer t~e to search (cpu sec()nds) 1465 seconds (includes retrieval + ranking + indexing 50() docs per (luery). 1. renieval tilne (total cpu sec()nds between when a query enters the system until a list of document numbers are obtained) 2. ranking time (total cpu seconds to sort d(icument list) B. Which methods best describe your machine searching methods? 1. vector space rn(xlel C. What factors are included in your raiiking? 1. term frequency 2. inverse d(~ument frequency 7. proxilnity of terms within sentence ne~ded for local sim. 8. information theoretic weights 9. doc~ent length IV. What machine did you conduct die TREC experilnent on? Sun SPARC 2 How much RAM did it have? 64 MB What wL~ flie clock rate of the CPU? 4(~ MHz V. Some systems are research prototypes and others are commercial. To help compare these systems: 1. How much "software engineerin(!" went into the development of your system? About 3 person-years fi)r the SMART system itself 2. Given appropriate resources, could y~~ur system be made to run f~~ter? By how much (estimate)? Of course! Retrieval local similarity needed to index 5O(~ docs per (luery; this could all be done in advance if a single l(~al appr()3ch had been decided on. 444 Reduce retriev~l tinie by a factor of 5. A 6 machine di~tri1)uted version of SMART should be faster by a factor of 3 for 1)0th indexing and retrieval. 3. What featuve.~ is your system missing that it would benefit by if it had them? Distributed versi~,n has not fully been implemented yet. 445 System Summary and Timing Cornell University Run 2: Phrase automatic ad hoc (Cornell Global/Local) General Comments The timings should be the lime to replicate runs from scratch, not including Irial runs, etc. The times should also be reasonably accurate. This Solnetimes will be diff~cult, such as getting total time for document indexing of huge text Sections, or mailually building a kilowledge base. Please do your best. I. Construction of indices, knowledge bases, and oIlier data structures (please describe all data structures that your system needs for searching) A. Which 1. of the following were used 10 build your data structures? stopword list a. how many words in list,! 570 2. is a conlrolled v(~abulary used? Not t')r single terms. A phrase list was automatically constructed from phrases occ~rflng 25 times or more in the first d(~ set (Dl). Only those phrases were used. 3. stemming yes a. staildard stemming algorithms which ones? SMART 4. tenn weighting In docs, tf * idt; cosine normalization over length of single terims (ntc) In queries, tt. * idi; cosine n(~rmalization over length of single, terms (ntc) In sentences, tt. * idt; no normalization (ntn) Phrases weighted using their natural tt~idl, cosine normalized by length of single terms, and divided by sqrt(2). [Phrase match worth 0.5 of single term match] 5. phrase discovery a. what kind of phrase? Adjacent non-stopwords, components stemmed, that occurred at least 25 times in the Dl document set. 6. syntactic parsing 7. word sense disainbiguation 8. heuristic associations 9. spelling checking (with mailual correction) 10. spelling correction 11. proper noun identification algorithm 12. tokenizer (recognizes dates, phone numbers, common pattenis) 13. are the mai1ually~indexed terms used? 14. other techniques used 10 build data structures (brief description) B. Statistics on data structures built from TREC text (please fill out each applicable section) 1. inverted index a. total amount of storage (me(Jabytes) 840 b. total computer lime to build (approximate number of hours) 9.7 hours to create doc vectors from text 0.9 hours to reweight doc vectors and produce inverted file c. is the pr(~ess completely automatic? yes d. are tenn positions within d(~uments stored? no 446 e. single terins only'? no 5. other data structures built from TREC text (what'?) Map from d(~cid to text location (also gives title for each doc) a. tot£~l ~`un()unt of storage (megabytes) 68 Ml)ytes. b. tot~Ll computer time to build (approximate number of hours) Time to create included in inverted tile creation ahove. C. is the pr(~ess completely automatic? Automatic other data structures built from TREC text (what'?) Map from internal concept to token string a. total ainount of storage (megabytes) 25 Ml)ytes b. total computer time to build (approximate number of hours) Time to create included in inverted tile creation ahove. C. is the pr(~ess completely automatic'? Automatic other data structures built from TREC text (what'?) Phrase dicti~~nary (controlled vocal)ulary) Phrases were adjacent n()n-st()pw()rds, components stemmed, that (~curred at least 25 times in the Dl document set. a. tot~'tl amount of stora~e (me~abytes) 14 Ml)ytes to store dictionary. b. total computer time to build (approximate number of hours) It took 5.8 CPU hours to index Dl, tinding ~ phrases and their collection stats. Of those phrases l58,(~(~(~ occurred at least 25 times. C. is the pr(~ess completely automatic'? C. Data built from sources other than the input text None, other than st()pword tile. II. Query constructirm (please fill out a section for each query construction method used) A. Automatic£tlly built quenes (ad hoc) 1. topic fields used Topic, Nationality, Narrative, Concepts, Factors, Description 2. total computer time to build query (cpu seconds) 2.7 seconds for all 50 (lueries 3. which of the following were used? a. term weighting with weights b~'i~ed ~ terms in topics (idf) b. phrase extraction from topics yes, using controlled list of phr~ses III. Searching A. Total computer time to search (cpu seconds) 2405 seconds (includes retrieval + ranking + indexing 5(~~ docs/query). 1. retrieval time (total cpu seconds between when a query enters the system until a list of document numbers are obtained) 2. ranking time (total cpu seconds to sort d(~ument list) B. Which Ineth(XIs best describe your machine searching methods'? 1. vector space m(XIel C. What factors are included in your railking? 1. tenn frequency 2. inverse d(~ument frequency 7. proximity of terms fi)r phrases and fi)r local similarity hetween sentences 9. document length 447 IV. What machine did you conduct tlie TREC expeI.ilnellt on? Sun SPARC 2 How mUch RAM did it have? 64 MB What wa~~ the clock rate of flie CPU? 4(~ Mhz V. Some Systems are research prototypes and others are commercial. To help compare these systems: 1. How mu~ "software engineen.n(7" went into the development of your system? AI)()ut 3 per.%()n-years f~)r the SMART system itself 2. Given appropriate resources, could your system be made to run faster? By how much (estimate)? Of course! 3. What features is your system missing that it would benefit by if it had them'? 448 System Summary and Timing Cornell University Run 3: Automatic routing (Cornell Ide f~edback) General CominenLs The timings should be the time to replicate runs from scratch, not including trial runs, etc. The ti'nes should also be reasonably accurate. This solnetilnes will be difficult, such as gettilig total time for document indexing of huge text sections, or `naiiually buildilig a ~owledge base. Please do your best. I. Construction of indices, knowledge b~'L~s, and other data structures (please describe your system needs for searching) all data structures that A. Which of the following were used to build your data structures? 1. stopword list a. how many w(Mds in list? 570 2. is a controlled v(~abulary used? no 3. stei~ing yes a. st~uidard stemming algorithms which ones? SMART b. mo~hological ~ui~ilysis 4. tenn wei~hting In docs + (1UeriC5, tf ~ idt; cosine normalization (ntc) (in docs idf is hased on c(~llection frequency within d(~ set Dl only) 5. phrase discovery 6. syntactic pL~5ing 7. word sense dis(~Tlbiguati()n 8. heuristic associations 9. spelling checking (with manual correction) 10. spelling correction 11. proper noun identification £..Llgori~m 12. tokenizer (recognizes dates, phone numbers, coi~on pattellis) 13. are the maimally-indexed terms used? 14. other techniques used to build ~ita structures (brief description) B. Statistics on ~ita structures built from TREC text (please fill out each applicable section) 1. inverted index a. total amount of storage (megabytes) 275 b. total computer t~e to build (approxilnate number of hours) 1.9 hours (not including time to index Dl to o1)tain collection frequency info) c. is the pr(~ess completely automatic? yes d. are term positions within documents stored? no C. sin(TlC terins only? yes 5. other data structures built from TREC text (what?) Map from d(K:id to text location (also gives title for each doc) a. total ainount of storage (megabytes) 24 MI)ytes. b. total computer time to build (approx~ate number of hours) Time to create included in inverted file creation al)ove. c. is the pr~~ess completely automatic? Automatic other data structures built from TREC text (what?) 449 Map from internal concept to token string a. total (linount of stora~'e (me&'abytes) 13 M1)ytes b. total CompUter time to build (approximate number of hours) Time to create included in inverted tile creation f~~r Dl C. is the pr(~ess completely automatic? Automatic C. Data built from sources other thaji the iuput text None, other than st()pword tile. II. Query construction (please fill out a sectioll l;()r cich query construction method used) D. Automatic~'tlly built queries (routing) 1. topic fields used Topic, Nationality, Narrative, ConcepL~, Factors, Description 2. total computer time to build query (cpu seconds) 3(tti 3. which of the ftillowing were used in building the query? a. terms selected from (1) topic (3) only documents with relevance judgments b. term weighting (1) with weights based on terms in topics (2) with weights kised oil terms in all training documents (3) with weights based on terms from documents with relevance judgments 1. expansion of quelies using previ()usly-constructed data structure (from part I) (1) which structure? 3(~ hest terms from relevant docs III. Searching A. Total computer time to se~irch (cpu seconds) 293 seconds (includes retrieval + ranking). 1. reliieval time (total cpu scc()nds between when a query enters the system until a list of document numbers uc obtained) 2. ranking time (t()t£d cpu seconds to sort d()(:ument list) B. Which methods best describe your machine searching methods'! 1. vector space m(KIel C. What factors are included ill ~()w. railking'? 1. term frequency 2. inverse d(icument frequency 9. document length IV. What machine did you conduct the TRLC experiment on'? Sun SPARC 2 How much RAM did it have'! 64 MB Wh~it was the clock rate of the CPU? 4(J Mhz V. Some Systems are rese(~ch prototypes and others are commercial. To help compare these systems: 1. How much ~software enL'ineering~' went into the development of your system? AI)out 3 person-years for the SMART system itself 2. Given appropriate resources, could your system be made to run f~ster? By how much (estimate)? Of course! 450 3. ~hat feature~ j~ ~()U~ .~y~teIn mi~~ing thai it would beiiefit by if ii had fliem? 451 System Summary and Timing University of California, Berkeley General Cominents The timings should be the time to replicate runs from scratch, not including trial runs, etc. The times should also be reasonably accurate. This sometimes will be difficult, such as getting total time fi)r d(~ument indexing of huge text sections, or manu(dly building a knowledge base. Please do your best. I. Construction of indices, knowledge bases, ~md other data structures (please describe all your system needs for se£~chin~) data structures that A. Which of the f~)ll()wing were used to build your data structures? 1. stopword list Yes, augmented SMART stoplist a. how many words in list? AI)()ut 6(~0 2. is a controlled v(~abul~~y used? no 3. steI~ing a. st~~d~'trd stelnining (~g()n.tl1ms yes which ones? SMART system (Version 10) stemmer b. morphological (uldysis flOlle 4. tenn weighting yes. Weights deterniin~d fi~()m various frequency statistics by logistic regression 5. phrase discovery none 6. syntactic p~~sin~ none 7. word sense dis~~nbiguati()n flofle 8. heuristic associations none 9. spelling checking (with inahual colTection) none 10. spelling correction none 11. proper noun identification £dgon.thm n(~ne 12. tokenizer (recognizes dates, phone numbers, coininon pattenis) none 13. are the m~~1ually-indexed terms used? no 14. other techniques used to build d~ta structures (brief description) B. Statistics ()~ d~ta structures built fiom TREC text (please fill out each applicable section) 1. inverted index a. total ~unount of stor£ige (megabytes) Ranges from 7E~ to 18(~ ml) for each of the five collections b. total computer time to build (approximate number of hours) Ranges from 6 to 14 hours t')r each of the five collections c. is the pr(~ess completely automatic? yes d. are term positions within documents stored? no C. single terms only? yes C. Data built from s~)urces other thui the input iex~ --no II. Query construction (please fill out a section ft)r each query construction method used) A. Automatically built queries (ad hoc) 452 1. topic fields used all 2. toLtl computer time to build query (cpu seconds) around 3 seconds total per (luery 3. which of the f~)llowin(~' were used? a. term weightiug with weights b~~';ed on te~s in topics j. other (describe) Al)5()Iute and relative frequency of each stem in query were used to weight the stems, using a f~)rmula ol)tained l)y logistic regression from the WSJ relevance data. III. Searching A. Total computer tilne to 5CL'I£Ch (cpu seconds) 1. retrieval t~e (total cpu seconds between when a query enters the system until a list of document numbers ~`ire obtained) 2. ranking time (tot~Ll cpu seconds to sort d~~ument list) B. Which methods best describe your machine se£u-chi'ig methods? 2. probabilistic model Yes, prol)al)ilistic searching l)ased on linked dependence assumption and two stages of logistic regression as descril)ed in Proceedings ACM/SIGIR Copenhagen June 1992. C. VVhat factors are included in your rL'U~king? 1. telin frequency 2. inverse d()(2ument frequency 3. other terin weights (where do they come from'?) see 15. l)elow 5. position in document stem occurrence frequencies in titles were douhled in some collections. 9. docwnent length 15. other (specify) variahles used were: al)s()lute and relative frequency of stem in query al)solute and relative frequency of stem in document inverse document frequency of stem in collection glohal relative frequency of stem in all document texts document length measured in stem-occurrences. IV. What machine did you conduct the TREC experilnent on'? Three ditTerent machines: 1. DECStation 5(K)()1125 with 16 Megal)ytes RAM for most work. 2. DECStation 5(~X~125 with 64 Megal)ytes RAM for a little. 3. IBM Model 3O9(~ for the logistic regression analysis. How much RAM did it have? What was the clock rate of the CP~~? 25 MHz for the 16 Megal)yte DECStation. This was used for the timed retrieval runs. 40 MHz f~)r the 64 Megabyte DECStation. V. Some systems are research prototypes and others ~`u-e commercial. To help compare these systems: 1. How much "software engineering't went into the development of your system? None except for the novel two-stage probabilistic logic. The Berkeley system is an experimental prototype only, programmed as a minimal modification of the SMART 453 system. 2. Given appropliate re%.~ource.%.', could your system be made to run f~ter? By how much (estimate)? Yes, see discussion in SMART's documentation: SMART is "not strongly optimized f(~r any ofle particular use." The Berkeley system has roughly the same efficiency charactenstics as SMART. 3. What features is your system missing that it would benefit by if it had them? Would pr(~l)al)ly l)eflefit from a conflator, a thesaurus, a disaml)iguator, phrase discovery, stem proximity detection, etc. The Berkeley system is a l)are-bones design, intended only to explore the workal)ility of staged logistic regression. 454 System Summary and Timing Universitaet Dortmund Single term automatic ad hoc run (Fuhi. 1ea~ing) General CoininenLs The timin~,'s should be the time to replicate runs from scr~itch, not including trial runs, etc. The times should also be reasonably accurate. This sometimes will be dimcult, such as getting total time f~)r d(~ument indexing of huge text sections, or mailually building a kliowledge base. Please do your best. I. Construction of iiidices, knowledge b~~es, and other data structures (please describe all data structures that your system needs for se~'irchin~) A. Which of the following were used to build your data structures? 1. stopword list a. how many words iii lisi? 57(~ 2. is a controlled v(~abuI~Lry used? no 3. ste1111nifl~,' yes a. st.'uid~ird ste[nlnint? aigon. thins which ones? SMART b. morphological aiialysis 4. tenn weighting In docs, linear c()nll)inati()n of several factors In ~iueries, tf * i(1t; COsIfle nornlalizati(~n (ntc) 5. phrase discovery no 6. syntactic PlrS~Il~r1 flO 7. word sense dis~~nbiguation no 8. heuristic ass&~iatk)ns no 9. spelling checking (with manual correction) no 10. spelling correction no 11. proper noun identification algorithm no 12. tokenizer (recognizes dates, phone numbers, coininon patterns) no 13. are the maiiually-indexed terms used? no 14. other techniques used to build data structures (brief description) Coefficients for linear coml)inations used in weighting were determined automatically using QI,Dl4udgrnenis of QI (~n Dl. This to~~ 1.7 hours (not including 2.6 hours to index Q1,DI). B. Statistics on data structures built. from TREC text (please fill out each applicable section) 1. inverted index a. to~Ll ~unount of stor~ige (me~~Lbytes) 69(~ b. total computer time to build (approximate number of hours) 4.7 hours to create doc vectors from text 1.7 hours to reweight doc vectors and pr(KIuce inverted flle c. is the pr(~ess completely ~`~utomatic? yes d. ~ term position5 within d(~uments stored? no e. single terms only? yes 5. other data structures built from TREC text (what?) Map from d(~id to text location (als(~ gives title for each doc) a. total ainount of stor~ige (megabytes) 68 MI)ytes. b. total computer time to build (approximate number of hours) 455 Tinie to creite included in inverted file creation al)()Ve. C. is the pr~~ess completely automatic'? yes other data structures built from TREC text (what'?) Map froni internal concept to token string ~`i. total ~`un()uIlt of stor~'ige (ineg~'ibytes) 18 MI)ytes b. total computer tilue to build (approx~ate number of hours) TIme t(~ create Included in inverted file creation ahove. C. is the pRXess completely £`~utomatic? yes C. Data built from sources other th~'ui the input text None, other than .`;t()pw()rd file. II. Query construction (please fill out a section for e£~h query construction method used) A. Automatic(~ly built queries (ad hoc) 1. topic fields used T()pk, Nationality, Narrative, Concepts, Factors, Description 2. total computer tilne to build query (cpu seconds) 1.5 seconds 3. which of the f()llowiIl(T were used'? a. term weighting with weights b~'~sed on tenus in topics (idf) III. Searching A. Total computer tilne to seardi (cpu seconds) 383 seconds (includes retrieval + ranking). 1. retrieval tilne (to~1 cpu seconds between when a query enters the system until a list of document numbers are obtained) 2. ranking tune (tot~d cpu seconds to sort d(~ument list) B. Which methods best describe your machine se~'irching methods'? 1. vector sp£'lce m(idel 2. probabilistic model C. What factors are included ill your ranking? 1. tenn frequency 2. inverse d(icument frequency 8. infonnation theoretic weights 9. docuinent length IV. What machine did you conduct the TREC experilnent on'? Sun SPARC 2 How much RAM did it have? 64 MB What was the clock rate of the CPU? 40 Mhz V. Some Systems are researdi prototypes and otliers are commercial. To help compare these systems: 1. How much "software engineerinLY" went into the development of your system? Aliout 3 person-years f~)r the SMART system itself 2 person-weeks ft)r the Fulir weighing code 2. Given appropriate resources, could your system be made to run f£'ister? By how much (estimate)'? Of course! A 6 machine distril)uted version of SMART should he faster hy a factor of 3 for hoth indexing and retrieval. 456 3. What feature~ i.~ your ~ysIe1n Ifl~~S~fl(tT that it would beuefit by if it had them? DI.%tflI)uted version has not fully I)een implemented yet. 457 System Summary and Timing Universitaet Dortmtind Phrase automatic ad hoc (Fuhr leariling) General Cominent.~ The fiming.~ .~hould be the time to replicate run~ from scratch, not including trial runs, etc. The times should also be reasonably accurate. This sometimes will be ditTicult, such as getting total time ft)r d()(:ument indexing of huge text sections, or mailually building a ~owledge base. Please do your best. I. Construction of indices, kfl()wlCdLTe bases, and other data structures (please describe all data structures that your system needs fi)r se£ircliin~) A. Which of the following were used to build your data structures? 1. stopword list a. how many words in list? 570 2. is a coiltrolled v(~abul~iry used? Not f~)r single tern's. A phrase list was aut(~rnatically c(~nstructed from phrases occurring 25 times or more In the first doc set (Dl). ouly those phrases were used. 3. ste1~ing yes a. st~d~ud stelnining algorithms which ones? SMART b. morphological alialysis 4 tenn weighting In docs, linear c()ml)inati()n of several factors In (lueries, tf * idf, c(jsine normalization (ntc) 5. phrase discovery a. what kind of phrase? Adjacent n(Jn-st()pwords, comp()nenLs stemmed, that occurred at least 25 times In the Dl document set. b. using statistical meth(ids c. using syiltactic methods 6. syiltactic p(~sing no 7. word sense disambiguation no 8. heuristic associations no 9. spelling checking (with manual correction) no 10. spelling correction ~ 11. proper Iloull identification algorithm no 12. tokenizer (reco(2nizes d£ttes, phone numbers, cominon patterns) no 13. ~ire the m~~ually-indexed terms used? no 14. other techniques used to build data structures (brief description) Coefficients f~)r linear c(~mhinati()Ils used in weighting were determined automatically using QI,DI,judgments of QI (~n DI. This took 2.4 hours (not including 5.6 hours to index QI,Dl). B. Statistics on data structures built from TREC text (please fill out each applicable section) I. inverted index a. total amomit of storage (megabytes) 840 b. total computer time to build (approximate number of hours) 9.7 hours to create doc vectors from text 458 2.9 hours to reweiglit doc vectors and pr(KIuce inverted tile C. is the pr(~ess coiiipletely aut()In£'Itic? yes d. ~`ire term positions wi~in d(X'ulnellts stored? no e. single terins only? Ilo 5. other data structures built from TREC text (what?) Map from d(~id t(~ text location (also gives title f(~r each dE)c) a. total ainoulit of storuge (niegabytes) 68 Ml)ytes. b. total computer tilne to build (approxu nate number of hours) Time t(~ create included in inverted tile creation al)()ve. C. is the pr(xess completely aut()Jn£itic? yes other data structures built from TREC text (what?) Map from internal concept to t~~ken string a. total ~unount of stor£ige (megabytes) 25 Ml)ytes b. total computer tilne to build (approxiznate number of hours) Time to create included in inverted tile creation ahove. C. is the pr(~ess completely automatic? yes other data structures built from TREC text (what?) Phrase dictionary (controlled v(~al)ulary) Phrases were adjacent n()n-stopw()rds, components stemmed, that occurred at least 25 times in the Dl document set. ~i. total unount of stor~ge (me~abytes) 14 Ml)ytes to store dictionary. b. total computer tillie to build (approx~~'ite number of hours) It took 5.8 hours to index Dl, finding ~ phrases and their collection stats. Ot those phrases l58,(~()() (~curred at least 25 times. C. is the ~r(icC55 completely automatic? C. Data built from source5 other thul the input text None, (~ther than st()pw()rd tile. II. Query construction (please fill out a section for each query construction method used) A. Autx)lnatically built queries (ad hoc) 1. topic fields used Topic, Nationality, Narrative, Concepts, Factors, Description 2. total computer tilne to build query (cpu seconds) 2.7 seconds 3. which of the f~)llowing were used? a. term weighting with weights b~~~ed on terms in topics (idf) b. phrase extraction from topics yes, using controlled list of phra~es III. Searching A. Tot~~ computer tilne to search (cpu seconds) 374 seconds (includes retrieval + ranking). 1. retrieval tilne (total cpu seconds between when a query enters the system until a list of document numbers al-c obtained) 2. railking time (total cpu seconds to sort d('cument list) B. Which methods best describe y~~ur machine searching methods? 1. vector space m(XIel 2. probabilistic model C. What factors cLrC included in your ranking? 459 1. tenn frequency 2. inverse d(~urnent frequency 7. proxilnity of terins (for phra.%'e'%*.) 8. infoi~ation theoretic weiglit.~ 9. document lengili IV. What machine did you conduct tlie TREC experilnent on'? Sun SPARC 2 How much RAM did it have? 64 MB VVhat was the clock rate of die CPU? 4(~ MHz V. Some systems £`tre rese£'~ch prototypes and others £`irC coin'nerci£'~. To help coinp~'ire diese systems: 1. `low much "soflw~ire en~ineering" went into the development of your system.? AI)oUt 3 person-years for the SMART system itself 2 person-weeks for the Fulir weighing code 2. Given appropriate resources, could your system be made to run faster? By how much (esti'n£'tte)? Of course! Can speed up phrase indexing l)y 4~~% l)y algorithm change (speed up has l)een done for single terms, l)ut not for phrases) 3. What features is your system missing that it would benefit by if it had them? 460 System Summary and Timing Universitaet Dortmund Automatic routing (RPI feedback) General Coininents The timings should be the time to replicate runs from scr'~tch, not including trial runs, etc. The tilnes should also be re~~onably accurate. This sometilnes will be difficult, such as gettilig total time ft)r d&~ument jiidexilig of huge text sections, or m~uiually buildilig a kiiowledge base. Please do your best. I. Construction of indices, knowledge bases, and other dattt structures (please describe all data structures that your system needs t;()r searching) A. Which of the ft)llowin(T were used 10 build your data structures? 1. st()pword list a. how many words in list? 57(~ 2. is a controlled v(~abul~u~y used? no 3. stelnilling yes a. st£~idard stemming ~`tlg()n thins which ones? S~~ART b. m()1~h()l()gic£'1l ~~alysis 4. 1dm weighting In docs + queries, tt. * idt; cosine normalization (ntc) (in docs idf is l)ased on collection frequency within doc set Dl only) 5. phrase discovery no 6. syntactic parsing no 7. word sense dis~~nbiguation n(~ 8. heuristic associations no 9. spelling checking (with manual correction) no 10. spelling correction no 11. proper noun identification algorithm n(i 12. tokenizer (rec()L'nizes d~tes, phone numbers, CoifliflOli patterils) no 13. are the m£'uiu£.illy-indexed terins used? Ilo 14. other techniques used to build ~ta structures (bnef description) no B. S~itistics on data structwes built from Tl~EC text (please fill out each applicable section) 1. inverted index a. total ~`uli()unt of stor£'ige (ineg~ibytes) 275 b. totil computer tilne to build (approxilnate number of hours) 1.9 hours (not including tllue to index Dl to o')tain collection frequency info) c. is the pr(x:ess completely ~`Lut()ln'Ltic? yes d. (`ne term positions wi~in (1(iculnents stored? Ilo e. single terms only? yes 5. other dali structures built from Tl~EC text (wh~'it?) Map from dodd to text location (also gives title for each doc) ~`i. total £`ui'ount of st()r£'ige (megabytes) 24 M')ytes. b. t()~l computer tilfle to build (approxilnate number of hours) Tinie t(~ create included in inverted tile creation ahove. c. is the pr~'cess completely (`tutomatic? yes other data structures built from TREC text (what?) 461 Map from int~rnal concept to token string a. total aiflOulit of st()r~1~e (`ne~~'~bytes) 13 Ml)ytes b. tot~'Il co'npu ter time to build (approxli~ate number of hours) TIme to create included in inverted tile creation of Dl. C. is the pr(~ess completely automatic? yes C. Data built from sources other th~~i ~e input text NE~Ile, other than st()pw()rd tile. II. Query Construction (please fill out a section for each query construction method used) D. Aut()Inatic~illy built queries (routing) 1. topic fields used all 2. toL~ computer tilne to build query (cpu seconds) 1300 seconds, not including time to Index Dl (3.0 hours) 3. which of the fi)llowiIlg were used in building the query? a. terms selected from (1) topic b. tefln weighting (1) with weights based on terms in topics (2) with weights b~ised on terins in all training documents (3) with weights ~sed on terms from documents with relevance judgments III. Searching A. Tot£~ computer tilne to se~irch (cpu seconds) 312 seconds (includes retrieval + ranking). 1. retrievLil tilne (t()~l cpu seconds between when a query enters the system until a list of document numbers aic ()bL~inCd) 2. ranking time (tot~'il cpu seconds to sort d(~uInenI list) B. Which methods best describe your in~~chii'e se~irching methods? 1. vector space m(xlel 2. probabilistic model C. What fac(()rs ~Lre included ill ~()w. raiiking? 1. terin frequency 2. inverse d(~uInent frequency 8. infonnation theoretic weights 9. document length IV. What machine did you conduct the TREC experilnent oil'? Sun SPARC 2 How much RAM did it have? 64 MB What w£L~ the clock rate of the CPU? 40 MHz V. Some systems are research prolotypes ~ind others £ue c()mmerci~d. To help compare these systems: 1. how much "software engineering" went into tl)e development of your system? AI)out 3 person-years f~)r the SMART system itself 2. ("iv en appropriate resources, could your system be made to run f~'tster? By how much (Cstim~ite)? Of course! Due to algorithm flaw, CPU time f~)r constructing routing ~iuery is al)out a factor of 462 5 t()( much (Algoritlilil tound I)e%'t terfll% to expand I)y even though we had re(1uested expan%~i()n I)y () terni). 3. What feature.~ i~ Y()U~ ~y~tein mi~~iflg that it would benefit by if it had them? 463 System Summary and Timing University of Illinois at Chicago General Coininents The timin~~s should be the tilne to replicate fulls from scratch, not including trial fulls, etc. The tilnes should also be reasonably accurate. This sometilnes will be difficult, such as getting total time for d&~ument indexing of huge text sections, or maiiually building a kuowledge base. Please do yoUr best. Construction of indices, knowledge b~'i~es. and other data structures (please describe all data structures that your system needs for se~'irchi'ig) Each document is represented as a set of word pairs. Pairs were formed from all adjacent words, plus all words separated Ily ()flC and two intermediate words. Documents were the unit of ()rgani7~ti(~n f(~r the data structure. If a pair occurred only once in a document it was dropped from the data structure for that document ~~nly. A sample record is as f(JII()ws: MULTIMEDIA ENCYCLOPEDIA 2 WSj88081S-E~14 The numl)er of times the pair occurred in tile document appears in the third field, just l)efore the document id. A. Which of the following were used to build your data structures'? 1. stopword list The stopword list fr(~m SMART versi(~n 10 was used. Some additional stop words from TREC markup codes were used. a. how many words in list'? The total size of the stoplist was 631 words. 2. is a controlled vocabul£'iry used'? none 3. stelnining none a. st~'rnd~'~d stemming algorithms which ones'! Some small stemming experimenLs were later perf(wmed using the code from SMART versi~~n 10 and three training (lueries. For (~uery 002 stemming had n(~ effect, while t')r (~uery (~6 it resulted in a 43% increase in recall, and f(~r (~uery (~9 a 73% impr(~vement in recall. b. InoIi,li()loI~ic~'~l ~`ui~'ilysis Ilolle 4. tenn weighting None. Weighting w~~' planned Ilut could not l)e implemented given limitations that arose. 5. phrase discovery a. what kind of phr~'tse? Word pairs occurring within three word positions Of one another. b. usin(g st~'~istical ineth(Kls All such pairs were identified. c. usin~ sylitaclic methods 6. syntactic p~'u'sint,' none 7. word sense dis~nbiguation fl()flC 8. heuristic aNsociations a. short definition of these £`L';s()ci~'~ti()IIs Only the l)asic pairing ass(iciatioIL~ used. 9. spellinL' checkinLY (with manu£'il correction) none 10. spellin~ correction none 464 11. proper noun identific~~ti()n (dg()n~rn n~~ne 12. tokenizer (rec()~nize.~ d~te~, phone nuiuber.~, coifliflon patteni.';) Il()~C 13. aic the rn~~u~~ly-iiidexed tenn~ used? none 14. other techniques used to build ~ita structwes (brief description) flofle B. Statistics on d~~ita structures built from Tl~EC text (please fill out each applicable secti()n) 1. inverted index Based only on pairs, not individual ternis. a. total ainount of storage (megabytes) 819 ~egahytes b. total computer tilne to build (approxu nate number of hours) 1(11) hours C. is the pr('cess completely automatic? yes d. £ire terin positions witlim d(icuments stored? no e. single terms only? flolie 2. n-grains, suffix (~ays, signature files See BI. C. Data built from sonices other th~ ~e input text --no II. Query construction (please fill out £,i section flir each query c()nsti~ucti()n method used) A. Automatically built queries (ad hoc) 1. topic fields used Title, Description, Narrative, and Concepts (only tirst two.) 2. to~l computer time to build query (cpu seconds) (1.26 SeCond. 3. which of the following were used? fiolle D. Automatically built queries (r()utin(~) 1. topic fields used Title, Description, Narrative, Concepts (first two). 2. total computer tune to build query (cpu seconds) 55 SeConds 3. which of thc following were used in building ~e query? c. phrase extrLlcti()n (2) from ~~Lll trunilig documents Word pairs occurring in the relevant training documents for the query 1)ut not in the irrelevant documents were used. III. Searching A. Total computer tilne to se(u~di (cpu seconds) 1. retiiev~tl tilne (total cpu seconds between when a query enters tlie system until a list of document numbers are ()bt~1ined) This was not optimized f~)r the current experiments. Run time was approximately 2(1 minutes per search. 1~r()per ()ptiniizati(~n will reduce this tinie. 2. r~~mkin~ time (total cpu seconds to sort d('cument list) .22 seconds B. Which metliods best describe your michine searching metliods? 4. n-gr£~ matching C. What f~~ct()rs aic included in y~~ur r~~tnking? 11. li-grun fiequency IV. What m£~hine did you conduct ~e TRF£C experiment on? IB~1 3(19()/3()(lj How much RAM did it h(tve? 16 Meg f(~r a virtual machine. Wh~~it w~Ls the clock rate (if ~e Cl~~ 1? 14.5 nanoseconds, or 69 MH7. V. Some systems aic research prototypes (md others ~ue c()Inmerci£(1. To help c(imp£ue ~ese systems: 465 1. How much ".`;oflw~ire ellgilleerillg" went into the development of your sy.~tem? 4(J h(~urs (~f new developmellt, 1)eyond using word pairing tools that were developed earlier over a peri(~ of years. 2. Given appropn'aie reN()urce.~, could your sy.~Iein be made to run f~~ter? By how much (e.';Iiinale)? Yes, search time could l)e reduced, hut a reliahle estimate of how much cannot he made at this time. 3. What features is your system missing (hal it would benefil by it' it had them? Phrase weighting Term weighting and auxiliary single term search Stemming Removal of pair order etTects Shortest path network search 466 System Summary and Timing Belicore General CoinmexiLs The timings should be the time to replicate runs from scratch, not including trial runs, etc. The t~nes should also be reasonably accurate. This soinetilnes will be difficult, such `~ gettin~ total time ft)r document indexing of huge text sections, or manually buildin(2 a k'iowledge base. Please do your best. I. Construction of indices, knowledge b~~es, and other data structures (ple£~~e describe all data your system needs for se~irching) structures that A. Which of the following were used to build your data structures? 1. stopword list yes (though SoniC experiments without stoplist) a. how many words in list? n=439; standard SMART list, I think 2. is a controlled v(~abul~uy used? no 3. steinining fl()flC (except truncation at 20 character.'~wd) 4. tenn weighting yes, l()g(tt)*(1~entr()py) 5. phrase discovery no 6. syntactic p~'irsing no 7. word sense disainbiguation no 8. heuristic £~ssociations no 9. spelling checking (with manual correction) ~() 10. spelling correction no (not directly, l)ut the LSI analyses does some of this fi)r free 11. proper noun identification ~dg()rithIn Ilo 12. tokenizer (recognizes dates, phone numbers, coininon pattenis) 13. are the manually-indexed terms used? no 14. other techniques used to build ~ta structures (brief description) LSIISVD analysis of term~l)y-d()cument matrix. Takes raw term-hy-doc matrix; transforms entries using log entr(~py term weightings; calculates hest "reduced-dimensi()nal" approximation to transformed matrix using SVD. Numl)er of dimensions 250-350. Does all (1uery-doc matching in this reduced-dimension vector space. B. Statistics on data structures built from TREC text (please fill out each applicable section) 5. other data structures built from ~~REC text (what?) LSIISVD uses reduced-dimensi()nal vectors (see l)elow fi)r description of how they are derived). The numl)er of dims was I)etween 235 and 250. There is one such vector for each term and fi)r each d(K:ument. Queries are also represented as vectors and compared to every document. a. total ainount of st()r£ige (Ine(2~.Ibytes) All reduced dimensional vectors are stored in a hinary datahase. Datahase c(~sists ~ a vector fi)r every doc and every term occurring in more than one doc. The vectors currently consist (~f single precision real values. For TREC, we huilt (jne datahase fi)r each collection. Approx. 50000 docs are sampled. Terms that occur in more than one of these documents are used in the SVD analysis. The remaining docs are added to the database. DOEI - docs: 226(~7, terms: 42221, ndim: 250-> 262 meg dl) 467 wSjI - docs: 99111, terms: ndim: 250 API - docs: 8493(~, terms: 78167, ndim: 25(~ ZWFI - docs: 7518(), terms: 6()565, ndim: 250 FRI - docs: 26207, terms: 54713, ndim: 25() W5j2 - docs: 7452~), terms: ndim: 235 AP2 - docs: 79923, terms: 82997, ndim: 235 ZIFF2 - docs: 5692(~, terms: 72197, ndim: 235 FR2 - docs: terms: 48728, ndim: 235 -> 169 meg dl) -> 163 meg dl) -> 135 meg dl) -> 80 meg dl) -> 141 meg dl) -> 153 meg dl) -> 121 meg dl) -> 64 meg dl) Used 25() dims fi)r routing and 235 dims for ad hoc (~uerjes In general, database size will be: (ndocs+nterms)*ndim*4 The totals here are 1288 meg (750000 docs and 585(~00 terms). If a single database had been used, the total would have been smaller becauSe of term overlap--currently, many of the terms are represented in more than oliC datal)ase; there are only 2000(M) Uni(Jue terms. b. t()t~l Computer tilne to build (~~ppr()xilnate iiuinber of hours) Four main stages: I. indexing (extracting keys; calculating wts; etc.) 2. SVD (number ~)f d1111C1151()fl5 extracted ranged from 235-310) NOTE 1: only 235-25() dims were actually used fi)r retrieval. I don't have timing data for extracting only this smaller numl)er of dimensions, but I'd estimate that the numbers t~)r APi, ZIFFi and FRi could l)C reduced by about 20%. NOTE 2: initial indexing and SVD are typically done on a subset of 50()00 docs and uterms 3. various i/o translations (much (~f this will g(j away soon) 4. adding new docs to dl)a5e (if sul)-sampled for SVD). SVI) done oil 5(~(H)() docs; the remaining docs are indexed and added to the datal)ase after the SVl). all times in ~1INUTES (SVD DOEI - index: 49 SVD: 1219 io: wSj1 - index: 241 SVD: 1474 i(): APi - index: 271 SVD: 1644 i(): ZIFFi - index: 241 SVD: 1359 i(): FRi - index: 241 SVD: 939 io: WSJ2 - index: 427 SVD: 1382 io: AP2 - index: 338 SVD: 1210 i(): ZIFF2 - index: 260 SVD: 1452 i(): FR2 - index: 187 SVD: 486 io: run on DECS()()(); rest on SPARC2) 194 add: 591 SUM: 2053 mins 174 add: 4()4 SUM: 2293 mins 214 add: 455 SUM: 2584 mins 156 add: 352 SUM: 2108 mins 133 add: 0 SUM: 1313 mins 22(J add: 461 SUM: 2490 mins 218 add: 273 SUM: 2(~9 mins 2()8 add: 0 SUM: 1920 mins loS add: 0 SUM: 778 mins C. i.~ the pr(~es.~ Completely ~ut()Jn~'ttiC? YES d. brief deNcription of Ineth()d.~ u.~ed LSI/SVD analysis (~f document collection 1. creates raw term-l)y.d()c matrix; transf~)rms entries using log entropy term weightings 2. calculates beSt "red uced-dimensional" approximation to transformed matrix using SV1). Number of dimensions in the SVD calculations ranged fi~()m 235 to 3I(~. BUT, only 235 ()~ 250 were used f(Jr the comparisons. Fewer dims could have been calculated, So Some reported SVD times are higher than necessary. I'd estimate about 2()% reductions in SVD times for API, ZIFFI, and FRI. 3. perf~)rm various datal)ase translations. Current SVD program outputs vectors in a different f(~rmat and order than we need for the database. It 468 Will eventually output vectors in the appropriate datahase format, and this entire step can l)e omitted. 4. SVD calculations usually run on -5(),()(M) docs x nterm%' matrices. The remaining docs (if any) were indexed and added to the datal)ase here. C. Data built from sources other th~ui tlie input text --no II. Query c()i'structioil (please till out a sectioll f()r each query consti-uction method used) A. Automatic~illy built quefles (ad hoc) yes 5u1)Initted two sets of ad hoc (1ueries; (1ueries were the same in hoth c~%'es; only difference was how information from diflerent sul)-c()llecti()ns was coml)ined 1. topic tields used all (except NO manually indexed terms used) 2. to~l computer tilne to build query (cpu seconds) Queries are vect(~r sums (~f constituent term vectors Separate query vector created fi~r matching against each of 9 datal)ases (DOE, WSJI, API, FRi, ZIFFI, WSJ2, AP2, FR2, ZIFF2) Time = .4 secI(1ueryldatal)ase -> 3.6 secs/(~uery NOTE: These times simulate handling each query separately (so there is no ilo l)utfering). There are l)ig improvements if you initially read in all the term vectors and create all the ad hoc queries at once. 3. which of the following were used? a. term weighting wi~ weights b~~ed on teims in topics term weighting, but weights based on term usage in document c(~llections Ii. expalision of quenes usin(~ previ()usly-constructed dr~ta structure (from p£irt I) not really D. Automatic~-dly built queries (routin{',) yes submitted two sets of routing queries. Both were automatically created from I) the text of the topics and 2) the relevant documents 1. topic tields used all (except NO manually indexed terms) for 1)0th 1) and 2) 2. total computer tilne to build qucly (cpu seconds) Queries are vector sum~. of constituent term vectors [case I)] ()~ document vectors (case 2)]. Separate query vector created for matching against each of 4 (WSJ1, APi, FRI, ZIFFI) Time = .4 seclqueryldatal)ase in case I) -> 1.6 secs/query Time = .1 sec/query/database in case 2)-> ().4 secsI(~uery NOTE: These times simulate handling each query separately buffering). separate databases 3. which of the ft)ll()win~ were used in buildin~ ~e query? a. terms selected from (1) topic case I) (3) only documents with relev~~ice judgments case 2) b. telin weighting (2) with wei{',hts b-~sed oil terms in all training d()c~ents (so there is no i/o III. Searching 469 A. Tot£~ coinpuler tune to se£'uch (cpu seconds) 1. retrieval tilne (to~Il CPU seconds between When a query enters tlie system Until a list of document numbers ~ire obtained) Time = -5()(~)(~ query-doc c()mparisonsiminute when all vectors are pre-loaded. Currently, we c~~mpare ALL d(~cs t(~ each query. For ad hoc queries, the time to c~~mpare a query to the 75(~K~~ docs is -12 minutes For r(~uting queries, the time to c~~mpare a query (new doc) to the profiles (50 profiles in each (~f 4 datal)ases) is al)out .3 Sec 2. rankini2 time (tot£'Ll cpu seconds to sort d(~Ument list) none; it's included in the times given in 1. Currently 1)0th comparisons and ranking are done in the same routine B. Which methods best describe ~()U~ `u~~chine se~'irching me~ods? 1. vector sp~'~e m(KIcl C. What factors £`ire included in y()w r~'uiking? Hum, not sure I get this. Similarity l)etWCeIl a query and a document is the cosine l)etween the query vector and the document vectE)r. This cosine determines the rank. Term weights are used to determine the location ~)t. the query vector. The query is located at the weighted vector sum (~f i~s constituent terms. 1. tenn fiequency l()g(tf)*(1~efltr~~py) term weight; s(~ there's a tf part 3. other tenn weights (where do they come from?) log entropy; weight come fi~om training docs (diski) for routing queries, and from both the training and test docs for ad hoc queries 4. se'n£uitic Closeness (as in semantic net distance) sort of; if you think of term vector I()cati()ns as reflecting semantic ~ssociations. But these locations are auto derived from the SVD analysis 8. infonnation theoretic weights IE)g (tt) * (1-entropy) IV. What machine did you condUct tlie TREC experiment on? How much RAM did it have? What was the clock rate of ~e CPU? SVDs run on DEC5t)t)() wi --4(~(~ meg; clock is ??? MHz all else run on Si~ARC 2 WI 384 meg; clock is 25 MHz (I think) V. Some systems ~we rese~~~ prototypes md others LrC c()Inmerci~'tl. To help C()InP~C tliese systems: 1. How much "softw(tre eIl'~illeenn~' went into the development of your system? Real hard. The system was huilt as a research prototype to l(x)k at many different issues. I'd say aboUt 1-2 person-years, l)ut this is much more than would have heen required if specs had l)eell fixed at the beginning. 2. (~`iven appr()pfl~1te rcs()Urces. coUld yoUr system be made to run f£%';ter? By how much (estim£'~le)? The existing tools were used pretty much as is for TREC, even though they were devel(~ped t(~ work with much smaller databases. Also, there are far more parameters and options than we typically use. Alm(~t no effort went into re-engineering for large databases ()~ to more efficiently handle what we now use as default parameters. Time in query c(~nstructi()n and retrieval are spent: 1) seeking for vectors in a single large database of term and doc vectors. The database could easily be split. 2) many calculations (scalings (~t. various sorts) are done on the fly. This could be eliminated if one knew that users wanted to retrieve only 470 documents, f~~r example. Currently l)()th terms and docs can he retrieved with the same pr~~grams and scaling isn't done until we see thit the user wants retrieved. 3) all calculations are done in tl()ating point. Could he done with integers. 4) each ad hoc (~uery was compared to EVERY d(~ument. This can he speeded up hy 5()~C document clustering algorithms that we have looked at. This can also he speeded up tremendously hy using more than one machine or hy using a parallel machine. All vectors are independent, so it's trivial to split query processing. I'd guess that improvements (~f a factor of 2-5 could he (Jl)tained just hy tweaking items 1), 2) and 3). Parallel query matching is the way to go. For example, we got speed-ups of 5()-1(M) times using a MasPar for query storage and processing with no attempt to optimize. In terms of pre-processing and SVD analyses: I) ahout 1(~% (~f the time is spent in unnecessary `10 translation (hecause we've patched together pre-existing t()()ls). Much of this will eventually g(i away. 2) more than 5(J% of the time is spent in the SVD. These alg()flthms get hetter and faster all the time (the algorithm we n~iw use is ahout I(X~ times faster than what we used initially). There are speed-memory trade()ff%' in different SVD algorithms, so time can pr()hal)ly he decreased hy a factor of 2 ()~ 3 hy using more memory. Parallel alg(~rithms will help Some, hut pr()hahly only hy a factor or 2 ()~ 3. These are (~Ile-tinle costs f~)r relatively stahle domains. We've found that new items can he added to the existing solutions without redoing the scaling f~)r a while. Others ??? 3. VVhat features is your system inissin~ th£'it it would benefit by if it had them? Precision would prol)ahly he increased hy many of the standard things--phrases, proper noun identitication, tokenizer (f~~r dates, phone nuinhers, addresses, etc.), and some hetter handling of negation and union. S(~me form (~f literal string matching might he useful to use in *comhinati()n with LSI for some types of queries. Others ??? 471 System Summary and Timing Queens College, CUNY General Conunents The fimings should be the tilne to replic~ite ruiis from scratch, not including trial runs, etc. The tilnes should also he reasonably accur~'~te. This sometilnes will be difficult, such as getting total time for document indexing of huge text sections, or `n~~ually building 1 kilowledge base. Please do your best. I. Constructioll of indices, knowledge b~ises, £Lnd other dat~'~ structures (please describe your system needs for se~'ucliin~) all da~ structures that A. Which of the following were used to build your data structures? 1. st()pw()rd list yes a. how many words in list'? 595 2. is a c()ntR)lled v(~abul~lry used'! no 3. stelninilltT a. st£uid~ird steimnilig algon thins yes which ones'? I~()rter's Algorithm b. In()1~h()l()gical (`u1'Llysis 11(~ 4. telin weighting yes 5. phr~'ise discoveiy n~j 6. syntactic p~u;sinL' no 7. word sense dis'unbitjuati()n 11(~ 8. heuristic associations n(j 9. spelling checki'i~ (with manual collection) no 10. spellitiLl colTection Ilo 11. proper noun idCntifiC~tti()Il L'il~()ri thin no 12. tokellizer (reco~flizes d~'LtCs, phone nujnbers, COiThflOfl patterus) no 13. £`u'e the in~tiiually-ii'dexed tenns used'! 110 14. other techniques used 10 build d~it~i structures (brief descuption) A tal)le of 396 manually created 2-word phrases. When these are identifled in adjacent positions in documents or (~ueries, they are used as additional index terms. B. St£~tistics on d:ita structures built fiom TREC text (please fill out each applicable section) 1. uiverted index a. total ~~n()unt of storage (megabytes) 378 b. total computer tune to build (approxilnate number of hours) 95+11+2=11)8 fi~r 5(1(1MB. clock tilne c. is the process completely automatic? Yes, if sutlicient disk. Not in this experiment. if not, ~tppr()xiinL'Itely how many hours of manual labor? (1.5 d. ~`ue term positions within d('cuments stored'? No, Ilut sentence yes. Call modify to capture word positions. C. single terms only'? Yes, except t~)r I.A.14. 4. special routing structures (wh~t?) See I.B.5 Network node, edge tiles. Routing using network node and edge files is straightforward. £1. total £un()unt of st()r:lge (ine~abytes) Node tile: 4x7.5 Edge tile: 4x4 Netw(~rk segmented int(~ 4, Ilecause (~f insufficient ram. b. t()L~I computer tune to build (appr()xilnL~te number of hours) 472 4(~+5+1+4x().2=46.8, starting from text tile. C. is the pr~}cess colupletely £~utornatic? yes if sufticient rain and disk space. (1. brief descriptioii of methods used 1. Process (old) collection A. 2. Process (lueries against collection A. 3. Process new collection B as if they were (lueries--to make use of collection A statistics. 4. C()ml)ine (iuerles, (old) dictionary and collection B into network for retrieval. 5. other data structures buill from TREC text (what?) 1. Suhd(~cun1ent file 2. C(~ed tile 3. D~~id checking file 4. Termid checking tile 5. Docnum tile 6. Termnum (dictionary) file 7. Direct tile 8. Index to direct tile 9. N('(le tile lo. Edge file a. totil ~un()uflt of st()r(~ge (me~aby(es) 1.481 2.324 3.7 4.4 5.11 6.6 7.372 8.19 9. 4x14 lo. 4x9 System was developed for experimental research, with tlexil)ility to generate other data. Some of the tiles are not necessary for retrieval. b. total computer tillie to build (approxil nate number of hours) 1. 1.5 2,3,4,5,6. 95 7,8. 11 9,1(). 4x(~.25=1 C. is the pr('cess completely aut()m£~tic? Yes Ir sutTicient RAM and disk space. For this experiment, no. if not, ~ipproxim~itely how many hours of m~uiual labor? 2 d brief description of methods used ra~v text -.> sul)d()cunlent tile sul)d()cuIllent --> c(~ded tile, dodd file, termid tile, docnum (dictionary) file. Zipf-law prograni truncates dictionary via user assigned limits. Coded, terninuni --> direct file with index direct -> inverted file direct, inverted --> node, edge tiles. C. Data built from sources other th('w ~e input text 1. inte~~illy-built auxiliL~y files a. domain independent ()~ d()m£lin specific (if two of questions for each file) phrase file b. type of file (thesaurus, knowledge b~';e, lexicon, C. total ainount of st()r£ige (meg~'iby(es) (~.E)E)5 d. total number of concepts represented 396 f. tOtL~ computer tilne to build (approxiluate number of hours) (~ (this is a tile created via editor). 473 file, termnum sep£Lrate files, please fill out one set etc.) word pair g. total m~u1u('d tilne to build (approximate number of hours) 16 h. use of m(~nu('1l l'!b()r (4) other (describe) Search for WSJ terminology in 1il)rary and from topics. 2. externally-built auxili('uy file ~() II. Query construction (please fill out a section for each query construction method used) A. Automatic~tlly built queries (ad hoc) 1. topic fields used , <DESC>, <NARR>, <CON> 2. total computer tilne to build query (cpu seconds) 5 (average for each query). 3. which of the following were used? a. term weighting with weights b~~~ed or' terms in topics yes + others h. Cxp(~1si()n of queries usin~ previously-constructed data structure (from part. I) yes (1) which structure? word-pair phrase tile B. Manually constructed queries (ad 11(x) 1. topic fields used <TITLE>, <DESC>, <NARR>, <CON> 2. average t~e to build query (minutes) 3(XJ mjiiutes for 25 queries 3. type of query builder b. computer system expert 4. tools used to build query £t. word frequency list sometimes b. knowledge base browser (knowledge base described in p~ut I) no c. other lexical tools (identify) 110 5. which of the following ~vei-e used'? a. term weighting b. BOolean collilectors (AND, oR, N()T) d. additk)n of terms not included in topic (1) source of terms word-pair phrase tile C. Feedback (ad hoc) 1. initial query built by method 1 or meth('d 2'? method 1 2. type of person doing feedkick b. system expert 3. average tilne to do complete feedback a. cpu tilne (total cpu seconds for all iterations) 12 per query per iteration--no expansion `I"" " --with expansion b. cl(xk time from initial construction of query to completion of final query (minutes) 6(~ per query to do relevance judgment 4. average number of iterations I a. average nwnber of d(x'ulnents ex£~nined per iteration 1(1 5. minimum number of iterations I 6. maximum number of iterations I 7. what determines the end of an iterition? deadline + lack of manpower 8. feedback methods used a. automatic term reweighting "loin relevant documents b. automatic query exp~'wsi()n from relev£~it documents (2) only top X terms added (what is X) Top 2E) most `activated' terms that have document frequency < 2(~()() were used. Because many were already in query, ahout 12 on the average were new and added per query. 474 C. other automatic methods bnef descriptioll feedhack is l)ased on sul)-d()cuments D. Automatic£'illy built queries (routing) 1. topic fields used <TITLE>, <DESC>, <NARR>, <CON> 2. total computer time to build query (cpu seconds) 5 (average f(~r each (luery) 3. which of the following were used in building tlie query? a. terms selected from (1) topic b. telin weighting (1) with weights bised on terms in topics (2) with weights based on terms in all training documents (3) with weights based on terms from documents wid) relevance judgmenL'; 1. eXpansion of queries usinLT previously-constructed data structure (from part I) (1) which structure? word-pair phrase tile E. Manually constructed queries (routing) 1. topic fields used <TITLE>, <DESC>, <NARR>, <CON> 2. average tilne to build query (minutes) 3()(~ minutes for 25 (jueries 3. type of query builder b. system expert 4. data used fi)r buildin~ query a. from trijiling topic 5. tools used to build query a. word frequency list sonletinles 6. which of the f()ll()win~ were used? a. term weighting b. 13()()le£'i'1 connectors (ANt), oR, NoT) III. Searching A. Total computer time to se~uch (cpu seconds) 1. retriev~'il tilne (total cpu seconds between when ~`i query enters ~e document numbcrs £ue ()bt~iined) 8-2(~ per query without soft-Boolean (Conll)ifle 2 methods). with " ` (coml)ine 3 methods). 2. ranking time (total cpu seconds to sort d(~ument list) 4-12 per query B. Which methods best describe y~~ur Inichille searching me~()ds? 2. pr()babilistic model 8. neur~il networks C. Whai factors are included in your raiiking? 1. tenn frequency 3. other term weights (where do they ColfiC fi-om?) inverse collection term frequency total word occurrences 9. document leng~ IV. What machine did you conduct ~e TRE(' experiment on? SPARC-2GS How much RAM did it have? 48 ~1B WhLit w~-is the clock r~-ite of ilie CPU? 4(~ ~1Hz V. Some systems are rese~'irch prototypes and others ~`ue colninercial. To help C()mp~u-e tl)ese systems: 475 system until a list of I. How much "soflw~'ire efl(TiIleerin(~"' went into the developmeiit of your system? N(~t much, time wa% spent to truncate record slzes to save space and fit certain structures in memory; replace 5()~C linked lists with arrays. 2. Given appropri~ite resources, could your system be made to run faster? By how much (estimate)? Yes. 5(~1(N~%. Lots (~f code was translated from PASCAL to C and UL~ed as is. 3. Wh~'u features is your system missing that it would benefit by if it had them? Dedicated SPARCstation. More RAM. More disk space. 476 System Summary and Timing New York University General Coinments The fimings should be the time to replicate runs from scratch, not including trial runs, etc. The times should also be reasonably accurate. This sometimes will be difticult, such as getting total time for document indexing of huge text sections, or maiiu~llly building a kliowledge base. Please do your best. I. Construction of indices, knowledge b%~es, and other data structures (please describe your system needs for se~'irching) A. Which of the fo11owin~ were used to build your data structures? 1. st()pw()rd list yes a. how many words in list.? 38() 2. is a controlled v(~abul~u~y used? Ilo 3. stemming yes a. st£~d~~d stemming ~dgorithIns 110 b. m()1'ph()logical (~(dysis yes 4. teim weighting yes 5. phrase discovely yes a. whit kind of phr£ise? NI~'s Vl~'s, others b. using st~itistical meth(xls partially C. using syntictic methods yes 6. syntictic p£~sing yes 7. word sense dis~unbiguati()n no 8. licuristic associations yes a. short definition of these ~iss()ciations synonymy, specializations 9. spelling checkiu~ (with in~uiual CoflectiOn) 110 10. spelling ColTection no 11. proper noun identific£ttion algorithm partial 12. tokenizer (recognizes dates, phone numbers, coinmon patte~s) no 13. ~ire the m~inuLilly-indexed terms used? 110 14. other techniques used to build ~ta structures (brief description) all data structures that B. Statistics on d~ta structures built from TREC text (please fill out each applicable section) 1. inverted index ~i. toial unount of st()r£ige (megabytes) 29(~ MB (().5 GI)yte txt) b. toLd computer time to build ((Ipproxilnate number of hours) 25() c. is the pr(~ess completely ~iutomatic? yes d. ~ire term positions wi~in d('cuments stored? yes e. sin~le terms only.? 110 3. knowledge bases yes a. toLd £~ount of stol-Lige (meg~'ibytes) (1.5 b. toLd number of concepts represented 3262 c. type of represenLitioll (fr(unes, semantic nets, rules, etc.) ass(lciations d. total computer time to build (approximate number of hours) 175 C. to~l minual time to build (ipproxilnate number of hours) (I f. use of manual lalx)r none g. auxili~iry files needed fi)r machine use (1) m~ichine-readible dictionny (which one?) OALD 477 C. Data built from sources other th~'ui ~e input text --no II. Query colistruction (ple~i~e fill out a section I;()r e~ich query construction method used) A. Aut()Inatic(IIly built queries (~`id hoc) yes 1. topic fields used <nuni> <title> <desc> and <narr> 2. toI~'il computer tilne to build query (cpu seconds) 3.() 3. which of the f()llowin(~ were used? b. phrase extraction from topics C. syntactic p~trsiilg of topics C. proper IIOUll identification al~ontliin partial g. heuristic £`Lssociati()I's to add terms h. exp~uision of queries using previously-constructed data structure (from part I) (1) which structure? term clusters j. other (describe) syntactic phra%es D. Automatically built queries (routin~) 1. topic fields used same as in ad hoc 2. total computer tilne to build query (cpu seconds) 3.2 3. which of the following were used in building tlie query? a. terms selected from (2) all tr£'tining documents b. tenil weighting (2) with weiL~hts based on terms in c. phrL~';e extraction (I) from topics (2) from all Inuning d()cument.~ d. syn(~ictic p£'irsin~ (1) of topics (2) of ~`dl tr'.uning d(~uments f. proper noun idCIltiliCL'Lti()II algolithin (1) from lopics partial (2) fr~)in all tr'.Lining documents partial Ii. heuristic £~`;s()ciati()ns to add terms (2) from all tr"uni'ig documenL~ i. expansion of queries using previ()usly-constructed data structure (from part I) (1) which structure? clusters from training data all training docwnents III. Searching A. Total computer tilne to se£'uch (cpu seconds) 1. retrieval tilne (total cpu seconds between when a query enters tlie system until a list of document numbers are ObtL'lii'Cd) TOTAL TIME (CPU + I/O) search and ranking is al)out 6(~ minutes per query 2. ranking tune (total cpu seconds to sort d(x:ument list) B. Which methods best describe your machine searching mettiods? 1. vector space m(slel C. What f~tctors are included in your rankiug? 1. tenn frequency 2. inverse d(icument frequency 478 IV. What machine did you conduct tl)e TREC experilnent oii'? How much RAM did it have? 56 Ml)ytes What w~'~ the clock rate of ~e CPU? 28 MIPS V. Some system.'; are research pr()t()types and others are commerci~~. To help compare ~ese systems: 1. Flow much "software en&2ineen'ig" went ilito the development of your system? A lot 2. (jivell appropriate resources, could your system be made to run faster? By how much (estimate)? hase IR system is very inefticient flow 3. What features is your system missing that it would benefit by if it had them? There is still a lot (~f room for improvement of NLP pr(~ram~; more time and experiments are re(lllired 479 System Summary and Timing University of Central Florida General Conunents The fimings should be the tune to replicate runs ~m scratch, not including trial runs, etc. The times should also be re~~onably accurate. This sometunes will be difficult, such as getting total time ror document indexing of huge text sections, or m£'uiu£..llly build~ig a kilowledge base. Please do your best. I. Construction of indices, knowledge bases, and other data structures (please describe all data structures that your system needs for se~lrChin(J) A. Which of tlie folk)wing were used to build your data structures? 1. stopword list yes a. how many words in list? 166 stop words, 122 al)l)reviati()ns, 47 hyphenated words, 24 entries for al)I)reviati()ns and alternate n()ti()ns for months, 35 entries for legitimate words `lot to Ile prefixed, and 6 entries for legitimate pretixes. 2. is ~t coutrolled v(icabul('uy used? Il(~ 3. stemmin~' yes a. st'wd£u-d stemming alg()ri~Ins which ones? .J.B. I~()vins' Stemming Algorithm (nl()dltied). b. m()1~h()l()gical ~uialysis Ilolle 4. telin weighting yes 5. phrase discovery ~ 6. syntactic p£Lr5i11(2 no 7. word sense dis~nbigu'.1ti()n Yes. The semantic lexicon we used is l)ased ()~ word senses f()und in Roget's Thesaurus. 8. heuristic ass()ci~Lti()ns Ilo 9. spelling checking (with inanutI c()11~ecti()n) no 10. spellin(2 corlection no 11. proper IloUII identification algoritlim Ilo 12. tokenizer (recognizes dites, phone numbers, coirunon patterns) yes a. which patterns £`ue tokenized? The QA System recognizes dates. But we felt it was not useful f~)r the NIST experiment so we removed this feature to improve text processing speed. 13. ~`~re the m~'uiu£illy-indexed tenns used? 110 14. other techniques used to build d~ta structures (brief descuption) The QA System uses B-tree storage structures ti)r inverted index tile access and semantic lexicon access. But for the NIST experiments, we used the QA System text scanning al)ility and coupled it with hash tal)le access (replacing tile B-tree access) and the use of 32-l)it Codes for text strings. B. Statistics on data structures built from TREC text (please fill out each applicable section) 1. inverted index yes a. tOtLil ~un()unt of st()ra~e (meg~ibytes) For Vol.1 the index storage was 385 megahytes. b. total computer tune to build (approx~nate number of hours) 73 hours using nine IBM 5(~ MHz 486 PCs running in parallel. c. is the Pr(~C55 completely automatic? yes 480 d. are te~ positions wi~in d('cuInents stored? no e. single terms only? yes C. Data built from sources other th£~i tlie input text yes 1. inteni~'illy-built auxili~iry files yes, a semantic lexicon ~1. domain independent ()~ d()m~Lin specific (if two separate files, please fill out one set of questions for each file) Domain independent b. type of file (thesaurus, knowledge b£'L~e, lexicon, etc.) Semantic lexicon l)uilt l)y examination of Roget's Thesaurus. C. tot~il £unount of storage (megabytes) 0.34 megal)ytes. d. total number of Concepts represented There are 36 semantic categories and there are approximately 24,E~0 words in tw(~ lexicons with the categories they trigger. The prohahility of each triggered category is aLso stored. e. type of representation (fr~es, semantic nets, rules, etc.) It could he viewed as rules. t.. total computer time to build (approximate number of hours) (1) if abeady built, how much time to modify for TREC? Since the 1911 edition of' Roget's Thesaurus hecame pul)lic domain recently, we spent approximately 16 hours creating the software to pr(~cess the 1911 Thesaurus. Approximately 6 hours of processing time was required to automatically extract 20,(WO lexicon entries. However, we did not have time to explore the use of these entries. g. t()t~'~l niwual tilne to build (approximate number of hours) (1) if already built, how much t~e to modify for TREC? Pn()r to TREC, there were 3,(HX~ entries in the lexicon established by manual processing of approximately 6,000 words in 300 hours. For TREC, we made 1,(HHJ new entries (in 85 hours) by examination ~ 1,7(~) frequently occurring words found in the training topi~s and the training text. S(~, the lexicon we used had 4,(Hlt~ entries in it. Ii. use of manual l~'ib()r (4) o~er (describe) Refer to (t) and (g). 2. exten'£'dly-built L'1uxili~uy file Ilo II. Query construction (please fill out a section for e~'ich query construction method used) A. Automatically built queries ((`Id hoc) yes 1. topic fields used All fields 2. toL~ computer tilne to build query (cpu seconds) 1 second 3. which of the following were used.? f. tokenizer (recoL'nizes d~tes, phone numbers, ~OInlfl()I' patterns) Dates recognizal)le but not used. Ii. exp~'ulsi()n of queries using previously-constructed d£~ta structure (from part I) (1) which structure'? Semantic lexicon described in I.C.1. j. other (describe) Term weighting based (~n terms in training text. D. Aut()matically built queries (r()utin~) yes 1. topic fields used All fields 2. total computer tilne to build query (cpu seconds) 1 second. 3. which of tlie following were used in buildin~ tlie query'! a. te~s selected from (1) topic b. teun weighting (2) with weights based on temis in all trainin~ docwnents 481 g. tokeiiizer (recogIlizes d£~tes, phone numbers, coimnon pattenis) Dates are recognized l)y the QA System l)ut were not used for the TREC experiments. i. expansion of queries usint~ previously-constructed data structure (from part I) (1) which structure? Semantic lexicon descril)ed in I.C.1. III. Searchiug A. Total computer ti'ne to se~'trch (cpu seconds) 3-1(~ minutes per (~uery to retrieve and rank. 1. retrieval ti'ne (total CPU seconds between when a query enters tlie system Until a list of docuineut numbers ~tre obLimed) 2. ranking time (t()t~tl CPU seconds to sort d(~ument list) B. Which methods best describe ~()U~ `n~~chine se~Lrching inetliods? 1. vector space m(XIel C. What f~tciors ~tre included ill ~()U~ ru~ing? 1. tenn frequency 2. inverse d(~ument frequency 9. docwnent length IV. What machine did yoU conduct the TREC experilnent on? We used nine IBM P512 Model 95 computers. These were ~ MHz 486 computers with 8 megahytes ~ RAM. Tw~~ of them had 16 megal)ytes of RAM. A 33 MHz 486 PC was used to distrihute text to the nine IBM PCS fi)r indexing and (~uery processing. How much RAM did it have? What ~ the clock rate of the CPU? V. Some Systems are research prototypes and others are coi"inerci~'tl. To help compare these systems: 1. How much "soRware engineeriIl~" went into the development of yoUr system? Our QA System (huilt for NASA and restricted to an IBM compatihle PC platform running under DOS and using ~() other license agreement commercial software such as a DOS extender) is a prototype and has heen under development for one and a half years. Approximately 2,E)()(~ hours (~f programming have heen used to develop the current s(~ftware. The system is implemented in C and uses B-tree structures for the inverted file structure. We felt our system was not fast enough to appear reasonahle f~)r TREC, so we designed a separate system without a pleasant user interface which used a hashing scheme to estal)lish codes for strings to cut down on st(~rage space; we also eliminated the use of B-trees in this separate system. We custom huilt a system for TREC during July and August; approximately 400 hours of programming and dehugging went into this effort. The custom system generated the results which we sent in. H(~wever, we are now trying to pr~'duce some semantic results using the original QA System. 2. (jivel' appr()pnate resources. could your system be made to run f£~ter? By how much (estimate)? Assuming we stay with DOS then we could easily run 8 to 16 times faster using the following: Hardware Improvements: 1. New 66 MHz PCs now on the market. 2. Multiple hard drives. 3. 16 ~ 32 megahytes of RAM instead of 8 megahytes to he used for a larger disk cache and for ()U~ hashing algorithms. 482 S(~ftware Inlprovenlent.%: I. Pr(~er U.%C (~ RAM t')r h~.'hing. 2. Use of a DOS extender or switch t(~ an OS/2 or UNIX environment. 3. What features IS your system missing that ii would beuefit by if it had them? The follE'wing software improvements would l~neflt the retrieval performance: 1. Relevance Feedback 2. Larger Semantic Lexicon 3. Breakdown of Lexicon into noun, verb, adjective, adverb, preposition, conjuncti~'n, intei~jection, use coupled with a part of speech tagger. 483 System Summary and Timing Advanced Decision Systems General Coininents The timings should be the tjine to replicate runs from scratch, not including trial runs, etc. The times should also be reasonably accurate. This sometimes will be difficult, such as ~ettin~ total time f~)r d(lcument indexing of huge text sections, or mai~u(..11ly buildin~~1 a k'iowledge base. Please do your best. I. Construction of indices, knowledge bases, and other data structures (please describe all your system needs for se~.irchin(T) data structures that A. Which of the followin(T were used to build your data structures? 1. stopword list yes a. how many words in list? 421 2. is a controlled voCabul~y used? no 3. stenYInin~ ~() 4. tenn weighting ~() 5. phrase discovely no 6. syntactic p'Ifsing no 7. word sense disambigultion ~() 8. heuristic £~ss()ci~'1ti()ns Ilo 9. spelling checking (with manuil c()1~ection) Ilo 10. spelling ColTeCtion no 11. proper noun identific£~ti()n ~d~()ritl1In no 12. tokenizer (rec()t2nizes dates, phone numbers, coinini)n patterns) no 13. ~LrC the m£~u~illy-indexed terms used? ~() 14. other techniques used to build data structures (brief description) original documents and yes--1)inary classitication trees Iluilt automatically from the topic statements B. Statistics on d£ita structures built from TREC text (please fill out each applicable seCtk)n) 5. other data structures built from TREC text (what?) yes~~classificati()n vectors; actually integer arrays a. total ~`yln()unt of st()r£~ge (megibytes) Only a few Kl)ytes fi)r the training sets used f()r tile oflicial scores--vectors generated oil the tly for routing the test data. b. tOtLtl computer time to build (approximate number of hours) Feature extraction takes less than lo seconds per document. c. is the pr(xess completely ~iutomatic? yes d. brief description of methods used Give a specification of a set of features, fl)r example, a list of word tokens; tile docunlent is searched fi)r the nunliler of (~currences of each feature. C. Data built from sources other th~w the input text --110 II. Query construction (please fill out ~ section for e~'ich query construction method used) D. Automatically built queries (routing) 484 1. topic fields used <desc>, <narr>, <con>, <det'> 2. total computer tune to build query (cpu seconds) takes less than ~ Seconds to huild the classitication tree including feature extraction- -this dE)e5 depend on the size of tile training set though 3. which of the followin~ were used in building ~e query? a. terms selected from (1) topic yes (2) all tr(Lining documents no (3) only documents with relevance judgments yes--including 5()~C additional judgments generated l)y us k. other (brief description) f~ature counts--in this case these are just word counts III. Searching A. Total computer t~e to search (cpu seconds) 1. retrieval tilne (total cpu seconds between when a query enters ~e system until a list of document numbers are obtalned) approximately 20 hours (sic) of elapsed time (jn the WSJ test set--no accurate measures of CPU time availal)le to us 2. rankuig time (total cpu seconds to sort d(~ument list) approximately 5 minutes of elapsed time--no accurate measures of CPU time availahie to us B. Which methods best describe your machine searchilig metliods? 10. other (describe) l)inary classification algorithm liased on counts of feature occurrence in the TES document C. What factors are included in your ranking? 15. other (specify) statistical estimate of the misclassification rate (prol)al)ility) of the classifier IV. What machine did you conduct tlie TREC experiment on? Sun SPARCstation IPC How much RAM did it have? 24M1) What wLts the clock rate of tlie CP~J? 2(~MHz V. Some Systems are resear~ prototypes and others are commercial. To help compare these systems: 1. How much "software engineerilig" went mx) the development of your system? approximately 4 person-weeks for the TREC infrastructure--the CART algorithm implementation used was `1otT the shelf" 2. Given appropriate resources, could your system be made to run t~~Lster? By how much (estimate)? Al)s()lutely! The feature extraction algorithms were not optimized for speed, and no datal)ase or indexes were huilt to do the testing. With faster algorithms and a set of inverted indexes, we estimate a d(~ument could he classified in less than 1 second. 3. What features is your system missing that it would benefit by if it had them? We would like t(~ experiment with "(~ff the shelf" to()ls to assist in feature 485 .%`pecirlcati(Il and xtrdcti(n, f(r xample: a part or .%`peech tagger, a t(kernzer, a proper name recEpgnizer. We aI.%() did n~~t explore the u.~ (jf concept-l)ased techni(1ues (e.g., RUBRICFFOPNC) t(P provide low-level concepts as features. 486 System Summary and Timing CITRI, Royal Melbourne Institute of Technology We are providing 2 reports oil the systeni. This is becLiuse we have tried experiments on Iwo very different systems, and tested quite different hypotlieses. Project: retrieval from a compressed datahL'b'e using the CoSine measure £~id approximate representations of d&icument lengths General CommenL~ The timings should be the time to replicate runs from scratch, not including trial runs, etc. The times should also be re~sonably accurate. This sometimes will be difficult, such ~`; getting tot~~ time f~)r d(icument indexing of huge text sections, or manuilly building a kiiowledge base. Please do your best. I. Construction of indices, knowledge kises, and other data structures (please describe all data structures that your system needs for se(~chiIi~) A. Which of the following were used to build your data structures? 1. stopword list no 2. is a controlled v('cabul~iry used? no 3. sten~ninL' yes, tor Construction (jf index a. staiid~ird stemming ~d gori ~ins which ones'! 1~()vins' 1968 algorithm 4. tenn weighting no 5. phrase discovery OE) 6. syntactic p~siIlg n(~ 7. word sense dis~bit~uL1ti()n no 8. heuristic ~~5sociations Ilo 9. spelling checking (with manual correction) n(j 10. spelling correction 110 11. proper noun identification Lilgoritlim no 12. tokenizer (recognizes dates, phone numbers, COifliflOli patterlis) no 13. are the manually-indexed terms used? no 14. other techniques used to build data structures (brief description) no, Ilut see discussion (~f compression below B. Statistics on ddata structures built fi-om TREC text (please fill out each applicable section) 1. inverted index a. total ainount of storage (Jne(~Tabytes) 5(~.7 Ml) (37.9 MI) f~~r pointers, 12.8 MI) for fre(1uencies) b. total computer time to build (approximate number of hours) 4.20 CPU hours, ()flCC a vocabulary has I)een huilt c. is the pr&~ess completely automatic? yes d. are term positions wi~in d(~uInents stored? no, l)ut term frequency within document is stored C. single terms only? yes 5. other data structures built from TREC text (what?) model for sul)se(1uent c()nlpressi(~n of text a. total ainount of storage (Ine(~Tabytes) 2.4 Ml) b. total computer time to build (approxu nate number of hours) 2.54 hours c. is the pr(~ess completely automatic? yes 487 d. brief descriptioll of Illetliods used count word and n(Jn-w(~rd fre(Iuencies using splay tree other data structures built from ~`REC text (what?) a single file (jf the text itselt; compressed a. loLil ~ufl()uflt of sI()r~I~e (incgabytes) 253.2 MI) b. to~l computer time to build (approxilnate number of hours) 3.10 cpu hours C. is the process completely automatic? yes d. brief description of methods used zero-order w()rd-l)ased model using Huffman c(KIing other data structures built from TREC text (what?) a file of document addresses and document lengths (f~)r cosine) a. total ~lin()uflt of st()r~i~e (megabytes) 1.8 Ml) b. total computer Ijine to build (approxilnate iiumber of hours) negligil)le other data structures built from TREC text (what?) vocal)ulary for inverted index a. total ~unount of stor~~~e (megabytes) 3.6 Ml) b. toL~l computer tilne to build (approxilnate number of hours) 2.41 cpu hours C. is the process completely automatic? yes d. brief description of Ineth(xls used count stemmed w~~rd fre(luencies using splay tree other da~ structures built from TREC text (what?) a file of inverted index entry addr~sses a. toLd ~unount of st()r£1~e (me(Tabytes) 1.2 Ml) b. total computer tilne to build (approxilnate number of hours) negligil)le other data structures built from TREC text (what?) a file of approximate document lengths a. total (unount of storaLTe (megabytes) 0.2 Ml) b. total computer tilne (0 build (approxilnate number of hours) negligihle C. Data built from sources other th(w the iuput text --no II. Query construction (please fill out a section for each query construction method used) A. Automatically built queries (ad hoc) 1. topic fields used all 2. toL~ computer tilne to build query (cpu seconds) less than one second 3. which of the ft)llowin(~ were used? a. tenn weightin~ witli weights k~ed on tenns in topics yes, as in cosine measure j. other (describe) used stop words to eliminate comnion words from query eliminated SGML tags and all punctuation III. Searching A. Total computer tilne to scaich (cpu seconds) I & 2 were not timed separately; 35 seconds per query to identify the top 2(10 ranked items further 4.6 seconds of cpu decompress the top 200 items, 18.6 seconds in total including retrieval time 1. retrieval time (total cpu seconds between when a query enters the system until a list of document numbers ~ obt'~ined) 2. rankin~ time (t()tal cpu seconds to sort d(~ument list) 488 B. Which rneth()d~ be~t deNcnbe your rn£ichine se~ching me~()d~? 1. vector ~pace m(KIel cosine measure C. What factors ~ue jucluded in y()UI railing? 1. tenn frequency 2. inver~e d(~urnent frequency 9. d(}cuIl'ent length approximate document lengths were used to reduce memory re(1uirements IV. What machine did ~()U conduct ~e TREC experilnent on? Sun SPARC 2 How much RAM did it have? 128 Ml) What wa~ the clock rate of flie CPU? 25 MIP V. Some system.~ ~`ire research prototypes and others ~ue commerciLd. To help compare ~ese systems: 1. How much "s()ftwLire en(jineenn(T" went into the development of your system? very little 2. Given appi-opriate resources, could y~~ur system he made to run f~~ster? By how much (estimate)? procesSIng to rank Items can lIe 3(~-5(~% faster; retrieval and decompression of text are currently Ilnilted by characteristics of tile disk and the UNIX operating system 3. What features is your system missing that it would benefit by if it had them? current transformation of topics into (1ueries is simple-minded the database is static 489 System Summary and Timing CITRI, Royal Melbourne Institute ol' Technology We are providing 2 rep~~rts oil the system. This is bec~'Luse we have tried experiments oil two very different systems, and tested quite differeni hypotheses. ProjecL' retrieval from a compressed daL~b~'~e using the CoSine measure aiid approximate representations of d(~ument lengths General Comments The fimings should be the tilne to replicate runs from scratch, not including trial runs, etc. The times should also be reasonably accurate. This soinetilnes will be diff'icult, such as getting total time for d('cument indexing of huge text sections, or m~ually building a k'iowledge base. Please do your best. I. Construction of indices, knowledge kises, ~`ind other data structures (please describe all data structures that your system needs for searching) A. Which of the f()lk)wing were used to build your data structures'! 1. st()pword list a. how many words in list'? 42(1 2. is a controlled v(~abulary used'! n(~ 3. stemming a. stalidard stemming ~`Llg(withms which ones'! I~()vifls' 1968 algorithm b. morphological (`~(`ilysis no 4. tenn weighting tf.idf 5. phrase discovery a. what kind of phrase? Adjacent pairs b. using statistical Ineth(ids yes C. using syiltactic methods n(~ 6. syntactic parsin~ no 7. word sense disambiguation 110 8. heuristic associat~ns no 9. spelling checking (with manual con-ection) (lueries only 10. spelling correction queries only 11. proper noun identification £LIg()rithm no 12. tokenizer (recognizes dates, phone numbers, common patterns) no 13. are the m£'wually-indexed terms used'! they were not discarded 14. other techniques used to build d,'ita structures (brief description) B. Statistics on data structures built from Tl~C text (please fill out each applicable section) 2. n-grams, suffix arrays, si~~nature t'iles a. total alnount of st()r'~L'C (me~abytes) Data (compressed) 220m Index 313m b. total computer time to build (approxil nate number of hours) 23 hrs c. brief description of methods used niulti-organisational signature FILE d. is the process completely aut()m~-Itic? yes C. Data built from sourCes oilier th&.ui the input text --no 490 II. Query construction (please till out a section for e~tch query construction method used) A large numl)er of techniques were tried. A. Automatically built queries (ad hoc) 1. topic fields used Boolean queries were constructed from a variety of the topic tields. The (lueries were then ranked p()ssil)Iy using ditTerent flelds. 2. total computer time to build query (cpu seconds) -10 3. which of the followiug were used? a. term weightilig wi~ weights b£Lsed on tenns in topics b. phrase extraction from topics i. automatic addition of B(x)lean connectors or proximity operators III. Searching A. Total computer tilne to search (cpu seconds) 1. retrieval tilne (total cpu seconds between when a query enters die system until a list of document numbers are obtained) 2. ranking time (total cpu seconds to sort d(icument list) These operations (wcurred together. It t()()k 6 lirs to ol)tain a ranked list of 1,000 documents for each of the 50 queries. B. Which methods best describe your machine searching mediods? 1. vector space m(XIel C. What factors are included in your ranking? 1. tenn frequency 2. inverse d(icument frequency 7. proxilnity of terms 9. docuinent lengtli 15. other (specify) The location of the term in the query. A variety of modeLs were tried. IV. What machine did ~()~ conduct die TREC experiment (m? Sun SPARC 2 How much RAM did it have? 128 Ml) What w~is the clock rate of tile CPU? 25 MIP V. Some systems are research prototypes and others are commercial. To help compare diese systems: I. How much "software CIIL'iilCCflIit(Y" went into the development of your system? The system is a rol)ust research tool. Limited eff(~rt has heen put into speed. 2. Given appropriate resources, could your system be made to run faster? By how much (estimate)? Consideral)Iy faster, hut we estimate it would twice ~s f~~t if we changed the architecture (we use UNIX pipCs to communicate). It Ls unclear what other speed-ups can (~cur. 3. What features is your system missing that it would benefit by if it had them? All sorts of things would l)e nice! A go(~ form of transaction management would lie the most useful t(J transform the system into a commercial product. 491 System Summary and Timing Australian Computing and Communications Institute General Comments The fimings should be the time to replicate runs from scratch, not including trial runs, etc. The times should also be reasonably accurate. This sometimes will be difficult, such as gettin~ total time for document indexing of huge text sections, or manually building a ~owledge base. Please do your best. I. Construction of indices, knowledge bases, and other data structures (please describe all data structures that your system needs for se(~ching) The software does not invert tile text. It inverts the (lueries (or through the c()ml)ined index formed from the (~ueries. filters) and passes the text II. Query construction (please fill out a section for each query construction method used) D. Automatically built queries (routin(~) 1. topic fields used Concept field used 2. total computer time to build query (cpu seconds) < 5 seconds 3. which of the following were used in building the query? a. terms selected from (1) k)pic b. tenn weighting (3) with weights based on terms from documents with relevance judgments Terms weighted with weights hased (~n terms from documents with relevance judgments, and dynamically m(xlified through the training set and the test set. c. phrase extraction (1) from topics j. £iutomatic addition of B()()leall connectors or proximity operators (1) using inf()rmati()n from the topics E. Manually constructed queries (routin(~) 1. topic fields used All topic fields used 2. average tilne to build query (minutes) 30 minutes 3. type of query builder b. system expert 4. data used fi)r building query a. from training topic 6. which of the fiAlowing were used? b. Boolean comiectors (AND, OR, NOT) c. proximity operators III. Searching A. Total computer time to search (cpu seconds) One message through 200 filters per second. This includes searching and ranking. 1. retrieval time (total cpu seconds between when a query enters the system until a list of document numbers are obtained) 492 2. r£inkin~ time (totil CPU ~ec()I1ds tO .`;ort d(~uIneI1t list) B. Which methods best describe ~()U~ niachiiie seirchiug methods? 6. fuzzy loiTic (juclude your deilnition) 10. other (describe) Software uses a fuzzy AND and Proximity measure to rank documents. C. What factors &`irC jucluded ill your rauking? 1. tenn frequency 5. position ill document 7. proxilnity of terms IV. What mach~ie did you conduct the TREC experilnent on? How much RAM did it h~ive? What w£~s the clock rate of die CPU? The experiments were run on an HP 486/33 with 8 MI)ytes under SCO UNIX. The CD ROM drive was accessed via NFS. V. Some systems are rese~irch prototypes and others ~`tre c()lnlnerci£il. To help comp~ire these systems: 1. How much "softw(lle en~i'ieeriIlg" went into the development of your system? 2. Given ~Ippr()pri~~te resources, could your system be made (0 fun f~~ster? By how much (estimate)? 3. What features is your syslein missing that i( would benerit by if it had them?- AMR is c()mn~rciaI strength software developed hy Computer Power Group. Its colilfllercialisati()n software engineering phase to(~k 5(~me three person-years. 493 System Summary and Timing Carnegie Mellon University General Comments The timings should be the time to replicate ruus from scratch, `lot including trial runs, etc. The times should also be ttasonably accurate. This sometilnes will be difficult, such as getting tot~ time for document indexing of huge text sections, or m~~iually building a knowledge base. Please do y~~ur best. I. Construction of indices, knowledge b~i~es, and other data structures (please describe all data structures that your system needs for se£'irc'1ing) A. Which of the following were used to build your data structures'? 1. stopword list No. But the NLP/m()rph()l()gical-analysis components of the system do limit the possible lexical categories of SoniC English words to eliminate useless ambiguities. For example, `9l)ut" is given lexical category "cnj" (conjunction) and not alternative, possible categories, such as "sn" (5ingular.n()un); "can" is limited to category "auxm" (nl(KIal.auxiliary.verl)) and not "sn91; etc. Such selective restrictions have some (~f the effects (jf "stop-word" lists, since spurious (or irrelevant) categories will not enter into later indexing stages. Furthermore, the NLPlparsing components of the system return simplex noun phrases (NI's) as candidate terms in which some components of the NP are eliminated, such as (luantitiers (e.g., "many", "one", etc.), determiners (e.g., "the", "a", etc.), and c(Jnjuncti()ns (e.g., "and", "or", etc.). In addition, in normal CLARIT NP processing, the parser does not return prepositions, non-NP adverbs, and extra-NP elements. Tli is practice, therefore, aLso has the effect of eliminating items that normally appear on "stop-word" lists. It clearly goes beyond that practice in eliminating all extra-NP words as well. a. how many words in list'? Approximately 1(~() lexical items have been given restrictive syntactic treatment, ill addition t(~ the words with unambiguously empty categories. 2. is a controlled v(~abul~'Lry used'? No 3. stemming No a. st~ind£~d stelnining algorithms which ones'? b. Inorpholo~'ical alialysis Yes. The Morph component of the system provides for comprehensive inflectional.m()rph()l()gical analysis. In practice, the morph-i~ormal form of nouns and adjectives is used in the NP-based terms of the system. Participles are not morphologically reduced (though it is possible to do so). Derivati()nal.m()rphol()gical analysis is not used. A lexicon of approximately ~ `r()()t-f~)rnl' items (English words) is the principal resource used by Morph in addition to its morphological rule set. 4. tenn weightin" Yes/No. The CLARiT process uses NLP to identify candidate terms in route to indexing, development of ~ss()ciated resources (e.g., thesauri), and analysis of queries or topics. These are taken as the `information units' of interest and are analyzed statistically and heuristically. `Weights' are ass('ciated with terms at various stages of pr(wessing. In indexing TREC documents, for example, an IDFfrF score was associated with terius for each document. In the case of multi-word terms (the 494 norm), the terms are assigned IDF/TF scores, and each word in the term is broken out and assigned an independent IDF/TF score. 5. phrase discovery Yes a. what kind of phrase? Simplex noun phrases (= all moditiers and the head of the NP but no deternijners, t1uantitiers, or post~head~position m(~ifying phrases or clauses). b. using statistical rneth(xis No. NPs retained fi)r thesaurus creation are scored using statistically-based measures ~)f expected `rarity' (based on component words), distribution, fre(1uency, and coverage. But N1~s are not identified in texts based on statistical parsing, for example. C. using syiltactic methods Yes. NI's are discovered using a parser that implements a `heuristic' grammar. In particular, following word-for-word morphological analysis (resulting in a set of syntactic-category tags t~)r each word encountered in a text), the parser identities the sul)se(luences that form NI's. Identification of NI's is based on rules that perf~)rm NI'~b()undary-c()ndition tests. 6. syntactic p~irsing Yes (see above). A single-pass parser follows morphol()gical analysis. 7. word sense dis~biguati()n No. No attempt is made to control for word senses in morphological or syntactic analysis. As noted above, disambiguation of grammatical categories is facilitated by restricting possible categories for selective items. In addition, absolute preferences are established for grammatical categories appearing in n~~un phrases. 8. heuristic ~~~sociations a. short definition of these ~L~5ociations Yes. The principal relation the system currently uses is that of `similarity' of terms. `Similarity' is determined by different procedures in different contexts. For example, partial or `fuzzy' matching of terim~ is facilitated by noting whether terms share words or attested sul)phrases. For example, in vector-space modeling of documents, the contained words of all terms (in the document vector as well as the query vector) are broken out, giving, in effect, the possibility ~ matching parts of terms (though, technically, the individual words are realized as independent dimensions of the space). In addition, in nominating terms for inclusion in thesauri and in matching terms to thesauri, CLARIT processing takes account of contained words and attested sul)phrases. 9. spelling checkin~ (with rn~'~nu~~l ColTection) No 10. spelling correction No 11. proper noun identification ~Ll~()ri~In YesINo. The system provides for identification ~ `candidate proper nouns' b~~ed on morphological analy %~~% (F%sentially, since the morphological analysis is virtually exhaustive for English, words that cannot be mapped to specific lexical ite~s are given the provisional label "cpn"--'candidate proper noun'--and parsing proceeds accordingly.) There is a facility in CLARIT for highly-reliable proper name (including acr(~nym) identification, but it was not used in this round of TREC processing. 12. tokenizer (recognizes dates, phone numbers, ColTilTIOll pattenis) a. which patteills ~ tokenized? Certain common abbreviations are included in the lexicon and, under morphol(~gical processing, are rendered into normalized forius. The system can utilize--and even partially discover--supplemental lexicons of domain-specific abbreviations and other phrasal-lexical patterns, but this facility was not used for TREC processing. 495 13. aje (he `n~rnu~~ly-indexed ienn~ u';ed? Yes/No. The manually-indexed terms were treated as additional text and processed (for NPs) along with the other sections (~f the topic statement. They `nay or many not have survived review; they were not given speCial treatment except as potential sources (~f NI's for the topic. 14. o(1'er techniques used to build &L'ita structures (bijef description) The CLARIT system has facilities for the discovery of `first-order' thesauri (= a list of important and characteristic terms) over collections of documents. The techni(Jue re(Iuires that documents in the collection be from the same `domain' or `topic' (broadly conceived) and is relial)le only if the d('cument set is large enough (e.g., minimally 2-3 megabytes). TREC topics--even when supplemented by sets of relevant documents--fall far short of the minimal size re(luired, so general CLARIT thesaurus discovery could not be used in preparing topics ()~ to support the indexing of texts. However, (~ne effect of the CLARIT thesaurus-discover procedure is to rank terms in a c(~lecti()n based on their fre(~uency, distribution, and `rarity' scores. In preparing sets of terms to assist in partitioning the TREC corpus (to identify a subset (~f documents with the best candidates under any topic), we produced pseudo-thesauri for each topic by using CLARIT thesaurus-discovery modules. In particular, the pr(~cess produced a list of terms from the available topic-relevant documents (or from a small sample of relevant documents that we may have found) and automatically chose the top (approximately 2()%) ranked terms to supplement the original (luery (as derived from the topic sta(ement) to produce a "r()utinglpartiti()ning thesaurus" t~)r the topic. (The use of this resource is described below.) Furthermore, in developuig extended (lueries for ~~ur final processing step (= a vector-space retrieval), we supplemented the original set of terms for the topic with *all* of the terms from the small set of top-ranking documents (as determined by routing/partitioning score) for each topic. B. Statistics on data structures built from TREC text (please fill out each applicable section) 4. special routing structures (what?) Yes. Each topic text was automatically analyzed by CLARIT to extract NI's. Terms nominated by parsing were reviewed by members (~f the CLARIT team for appropriateness (and retained ()~ eliminated) and given a weight of "1", "2", or "3" to (luantify relevance. Available topic-relevant d(wuments were processed for supplemental ternl% (each given a fi'actional weight, e.g., "0.3"). The combined list--terms from the topic text and terms from the topic-relevant documents--formed a "r()utinglpartitioning thesaurus" for the topic. Each TREC document was `scored' against the routing/partitioning thesaurus for each t(~pic. In particular, every NI' in each document was matched against the NI's (terms) in the routing thesaurus; partial matches were allowed; a formula yielded a composite score for the document based ()~ the number of exact and partial hits as a function (~f document length and term `value'. In the first round (first So topics) of processing, this approach was used to identify the highest-scoring 2()()() documents for each topic. a. total ~~~ount of storage (ine~abytes) ().6 megabytes f~)r the merged 50 routing structures, i.e., the 50 "r()utingipartiti()ning thesauri" for the So topics. b. total computer tilne to build (approximate number of hours) S minutes of real time--exclusive of the preparatory time to parse, build a simple index, find SOniC relevant documents, review them, and coml)ine them into an input file. c. is the process completely automatic? The manual review and weighting of terms from the topic statement took 496 approximately 5 minutes per topic. All additional steps were performed automatically. d. brief descriptioll of methods used (See al)()Ve.) 5. other d~ita structures buili from TREC text (what?) Each i'REC docunlelit had to l)e f(Jrmatted for CLARIT processing, hy making the uni(~ue text II) accessil)le to CLARIT as a special field and hy delimiting the heginning and end (~f each text in a tile. Intermediate (hut unretained) files generated in CLARIT processing include a tile of the words in each text, in their original order, annotated with morphological categories. Other files contain the output of the parser, as a list of NPs in the order in which they occurred in each text. The parsed representation of the text was retained and used at all sul)se(Iuent steps of pr('cessing. a. total ~Lin()unt of storige (megabytes) Processing steps are piped through the system; intermediate files are not retained. The parsed representation of all the texts takes up appr(~imately 98% of the space occupied hy the original text. b. total computer tilne to build (~Ipproxilnate number of hours) The total time to transform the original 2-gigahytes of text into parsed text takes ahout 10 real hours, with processing distrihuted over 5 machines. C. is the pr('cess completely aut()In~1tic? Yes d. brief description of methods used A `lex' pr~~gram was used to reformat the TREC text to CLARIT format. The English m(Irph()l(Jgical analyzer is written in C, and utilizes the lexicon of 97,000 items (mentioned ahove and further descrihed helow). The n(~un phrase parser, also written in C, uses the grammatical categories supplied I)y the m(~rph()l()gical analysis and an ATN-style rule set to extract n~~un phrases. C. Data built from sources other th(~ ~e iliput (ext 1. inte~('41ly-built auxili~uy tiles a. domaili independeut or dolnaul specific (if two sep~Lrate files, please till outone set of questions for e~~ch tile) Domain independent b. type of file (thesaurus, knowledge ~ lexicon, etc.) c. total ~un()unt of st()r~ge (megabytes) CLARiT Lexicon (2 megahytes) English -word statistics derived from the G rolier's Encycl(~pedia (2 megahytes) d. total number of concepts represented 97,000 words (CLARIT Lexicon) 139,(X~~ words ((;r()lier's list) e. type of representatioli (trwnes, semantic nets, rules, etc.) Lexicon: A sorted word list, giving for each word its possihle grammatical categories and category-dependent normalization. (;r()lier's: A list of words with distribution and frequency counts f. tot~-tl computer tilne to build (approxu nate number of hours) (1) if already built, how much tilne to modify ft)r TREC? Already huilt--Not modified for TREC g. total matiual tilne to build (approximate number of hours) (1) if aheady built, how much time to modify for TREC? Already huilt--Not modified fi)r TREC Ii. use of `nanu~ll labor (1) mostly `nanu£~ly built usin~' speci~~ interface (2) mostly machiuc built wi~ manu~~ con-ection (3) initi~d core m~mu~-illy built to "bootstrap" for completely machine-built 497 completion (4) ()~C~ (describe) Initially denved from on-line sources but substantially modified and maintained manually 2. extenially-built auxili~~y file a. type of file (Treebank, WordNet, etc.) None b. toL~l aifloUnt of storage (megabytes) C. to~~ number of concepts represented d. type of representation (fr~unes, selnailtic nets, rules, etc.) II. Query Construction (please fill oUt a section for each query construction method used) B. Manually constructed queries (ad h()(:) N(~te, as descnbed below, there were only two steps in the CLARIT process that re(luired non-automatic pr(wes.%ing: (I) initial review and weighting of the index terms aut()matically-n(Jminated and derived f~~r the topic and (2) review of 1st-pass retrieved documents to identify 5-I(~ relevant OneS for "feedback". 1. topic fields used <title>, <desc>, <narr>, <con>, <fac>, <del> 2. average tilne to build query (minutes) 5 minutes--average time to review & weight automatically-nominated terms 3. type of query builder Graduate students 4. tools used to build query c. other lexical tools (identify) CLARIT noun-phrase parsing (extraction) nominated query terms from the textual descriptions of topk's. 5. which of the following were used? a. terin weighting Yes. Graduate students weighted terms with weights of "3", "2", or "1", according to whether the extracted terin was central or peripheral to the topic. (Sonic extracted noun phrases were discarded as irrelevant or ill-formed; the vast majority were retained.) C. proxitnity operators No. Though proximity plays an implicit role when noun phrases are used as terms. d. addition of terins not jucluded ill topic (1) source of terms Not in the first round of routing C. other (describe) The ad hoc queries for the second fifty topi~s were formed in three stages. The first stage was the construction of a topic-derived routingipartitioning thesaurus. The routingipartitioning thesaurus was generated by CLARIT from the method described al)()ve, using only text fields of the topics. The automatically derived noun phrases were hand-weighted by graduate students with weights (~f "3", "2", or "1", according to whether the extracted term was central or peripheral to the topic. Some extraneous terms were deleted. The routingipartitioning thesaurus was passed over the parsed representation of original 1.2 gigabyte training set, inducing a ranking of all ~ documents using a scoring method taking account of exact and partial matches and document length. The top 5(~ documents were retained, for the next stage. These documents were manually judged by graduate 498 students, st~rting from the highest scored downward until 5-lo relevant documents were found. In etfect, this represented a "relevance-feedl)ack" step in the retrieval pr~~ess. In the next stage, the 5-I(~ "relevant" d(K:uments were used to produce a CLARIT-derived pseudo-thesaurus f~)r the topic. (As descril)ed ahove, this Consists of a list of prominent terms in the collection of documents, h~sed on frequency, distril)uti()n, and "rarity" scores.) To this thesaurus were added the ternis retained from the hand-weighting of the original topics. This thesaurus fi)rmed the second routing/partitioning thesaurus. The entire 2-gigahyte TREC collection was rescored against this second routingipartitioni ng thesaurus and the highest ranking 2(~(~(~ documents were selected fi)r the final-query stage. The third, ()~ final-query, stage involved, first, calculating an IDF/TF score fi)r each term and all term-contained words in the 2(X~)-document set for the topic. The query for that topic was created l)y taking the IDF/TF weightings ~ the ternis from the originally chosen 5-1(~ relevant documents and automatically forming a query l)y coml)ining all these terms along with the topic-derived terms into a long query vector. A vector-space representation (~f the 2EX)(~ documents was generated; the query vector was used to identify the final set of 2()() ranked documents for each topic l)ased oil cosine similarity measures. D. Automatic~'ylly built queries (routing) 1. topic fields used <title>, <desc>, <narr>, <con>, <fac>, <det~ 2. total coinpuler tilne to build query (cpu secoilds) (~.()3 cpu seconds 3. which of the fi)llowing were used iii building ~e query'? a. terms selected from (1) topic (3) only documents with relevuice j udgineilts b. tenn weighting (1) with weights based oil terms in topics Yes. Topic terms were initially hand weighted. c. phr~'ise extraction (1) from topics (3) from d(icuinents with relev~'uice judgments d. syntactic p£irsing (1) of topics (2) of ~`ill irLining documents (3) of documents wi~ relevance judgments g. tokenizer (rec()L~nizes d~~tes, phone numbers, CoilliflOil pattenis) (1) which patterns ~`ut tokenized'? Only simple acronyms such as "I.B.M." recognized as a unit. description) The routing queries were fi)rmed in two stages. The first stage was the construction of a routingipartitioning were automatically k. other (brief thesaurus. The routing/partitioning thesaurus was generated l)y CLARIT from the supplied list of relevant documents per topic. The text of the topic fields was parsed and added to the pseudo~thesaurus derived from the relevant d(wuments. (Each pseudo~thesaurus consists of automatically chosen noun phrases scoring ahove a certain threshold, when scored fi~r rarity, distrihution, and frequency in the relevant document set.) Partial noun phrases, derived from 499 thesaurus entries, and attested in the documents, were also added to the thesaurus with a partial score. The r()utingipartiti()ning thesaurus was passed over the parsed representation of 1.2-gigal)yte training set, inducing a ranking of all 5(~(),(~()(~ docunleilts. The top 2(~(X) documents were retained for the next stage. The next stage of construction of each topic's routing/partitioning query l)egafl l)y calculating the IDF![F score of all the terms and their contained words in the 2(K~(~ retained documents for that topic. The IDF[1~F-weighted terms fi~()m the 5 relevant documents that were ranked highest in the previous stage were added to the original hand-weighted query terms, forming the final query. For the second 900-megal)yte data set, the routingipartitioning thesaurus developed in the first stage of processing (~q descrihed al)()ve) was used to select the 2000 highest-ranked documents. The final query pr(KIuced in the second stage (al)ove) was used as a vector-space query (with partial matching) over the 2000 documents to produce a tinal set of 2(J(~ ranked documents for each topic. III. Searching A. Tot~il computer tilile to search (cpu seconds) 1. retrieval time (total CPU seconds between when a query enters the system until a list of document numbers (`trC ()bL~ined) The final set (jf 2()()(J documents for each topic was collected l)y the use of the r()utinglpartiti()ning thesaurus (descril)ed al)ove). This process was done simultaneously for all queries and took al)()ut 6 hours f~~r the complete corpus. 2. r£mking time (t()t£'Ll cpu seconds to sort document list) Once the vector-space matrix for the final set of 2E)(~(~ documents was constructed, the actual comparison of the query vect(~r to all other vectors in the matrix took on the order of 1()-2(J seconds. B. Which methods best describe your m£'ichine searching methods? 1. vector space m(xlel Yes. Using whole and partial matching on IDFITF-weighted terms. C. What flictors are included in your rai~ing? 1. tenn frequency 2. inverse d(~UmCnt frequency 3. other term weights (where do they come from?) Topic terms were given additional factors of "1", "2", or "3". 7. proxilility of terms Parts of noun phrases are close. Our partial matching of n~~un phrases implicitly includes proximity. 9. docwnent length IV. What machine did you conduct the TREC experilnent on'? How much RAM did it have? What w£~~ the clock rate of the CPU'? Total availalile machines, used variously: I DECstati()fl 582(~ (64-Meg RAM) 2 DECstati()Ii 5(X)() (32-Meg RAM) 500 1 DECstati()n 5000 (24-Meg RAM) 3 DECstati()n 3100 (24-Meg RAM) V. Some systems are research prototypes Lnd others `uc c()InInerci~'d. To help compare ~ese systems: 1. how much `s()ftw'£ue engineerin'~" went into the development of your system? The CLARIT system is a research prototype and has Ileen under development for 4 years. The original system was implemented in Lisp; the current system has 1)eefl re-engineered into C in the past 12 monthS. The specific configuration of the system used in the TREC experiments was produced in less than a week. As a research prototype, tile system has minimal true "software engineering". 2. Cuven appropri~'ite rcs~)urces, could your system be made to ilin f~Lster? By how much (estiIn~'~te)? Size constraints and the lack of gl(~I)al methods ~ attack caused us to duplicate work (I)()tll human and computer). (;l()1)al methods that are smarter aI)out resource CoilSuIliptioll could make an order of magnitude difference. Almost all CLARIT processing is modular and separal)le; results of pr(~cesses are additivelcomposal)le. Splitting the pr(~ess across machines--or running in parallel--would greatly speed up the system. 3. What features is your system missing that it would benefit by if it had them? User interface. Some datahase mechanism for document storage. Potential "next features" include the f()ll()wing: - automatic spelling correction - integrated pr(~per noun recognition - programmaille token recognition - progranimaille I automated category assignment (guessing) - pr()grammal)le I automated d~)cument structure analysis - automated syn~~nym I related word discovery and use - datahase support tor domains and thesauri, contexts, etc. - an integrated interface for 1)0th datal)ase construction and (Juery elaboration 501 System Summary and Timing ConQuest Software, Inc. General Comments The timings should be the time to replicate runs from scratch, not including trial runs, etc. The times should also be reasonably accurate. This sometimes will be dimcult, such as getting total time for document indexing of huge text sections, or mailually building a lalowledge base. Please do your best. I. Construction of indices, kuowledge b£~';es, and other data structures (ple~~se describe your system needs for se£~ching) all data structures that A. Which of the following were used to build your data structures? 1. stopword list yes a. how many words in list? 70 2. is a controlled v(~abul~lry used? no 3. stelnining a. st£..uid~ud stemming ~dg()ritl)nlN 110 b. In()rph()lo('ical £ulLdysis yes 4. (Cflfl weighting yes 5. phrase discovery yes a. what kind of phr~~e? I)araphr~se of Query b. usilig statistical meth(XIs Statistical proximity match c. using syiltactic methods Limited 6. syntactic parsing Linilted--PoS assignment 7. word sense disainbiguation In query hy user, & in explosion of terms 8. heuristic associations yes a. short definition of these associations Terms associated via semantic net 9. spelling checkin(2 (with manual correction) In query only 10. spelling correction no 11. proper noun identification ~~dg()rithIn If identitied l)y lexicon 12. tokenizer (recognizes dates, phone numbers, common pattenis) a. which pattenis are tokenized? Many 13. are the m~'~ually-indexed terins used? no 14. other techniques used to build d~ta structures (brief description) Index organized hierarchically so that best documents (based on a coarse grained ranking algorithm) are returned to user while search continues on very large databases. Linked lists are used to connect and identify idioms. Semantic network term explosion is c(~ntr()lIed by "weighted" links where weights are selected as either numerical or fuzzy sets based upon the link source and relatio~ship. B. Statistics on d~ta structures built from TREC text (please till oUt each applicable section) 1. inverted index a. total ~unount of stonige (me&iabytes) 1.2 Gb for 2.3 Gb text, 52% b. total computer tune to build (approximate number of hours) 150 c. is the pr&~ess completely automatic? yes if not, appmximately how many hours of manual labor? Setup--4 hours d. are term positions within d(icuments stored? yes C. single tenils only? no 3. knowledge bases a. total ainount of storage (meLYabytes) 12 Mbytes 502 b. t()~l number of c()1lcepts represented 25(),()(N~+ c(~ncepts, 1.5M links C. type of representLIti()Il (fr~unes, semaxitic Ilets, rules, etc.) Weighted semantic network d. total computer time to build (appr()x~ate iiumber of hours) (~, already had it e. total muiu~'d time to build (approximate uuinber of hours) (~ f. use of manual latx)r (2) mostly m('lchine built with manu£'il correction yes--I)ut prior to TREC, hot DB specitic g. auxili~iry tiles needed for machine use (1) `n£ichine-readable diction~'iry (which one?) ~erriaIn Wel)ster (al)ridged) (2) other (identify) Word Net, plus several thesaurus tiles C. Data built from 5OurCC5 other th'ui ~e iliput text See 3(g) al)ove 1. inteni~illy-built auxili~'iry files Semantic Netw(Jrk a. do'n~un independent or domain specific (if two sep~irate files, please fill out one set of questions for each file) b. type of file (thesaurus, knowledge b£~~e, lexicon, etc.) All in one C. total £lln()unt of stora~e (Ine~Tabytes) 12 d. total number of concepts represented 25(~,()(~(~+ C. type of represenL~ti()n (fi-unes, semantic nets, rules, etc.) Semantic net f. t()t£'Ll computer tilne to build (approxilnate number of hours) Already had (1) if £`itready built, how much time to modify for TREC? None g. total m'~u~tl time to build (approximate number of hours) Already had (1) if ah-e£'Ldy built, how much tune to modify for TREC.! None h. use of manual labor (2) mostly machine built with manu~'il correction II. Query constructioll (please till out a section for each query c()nstl~ucti()n method used) A. Autoinatic£illy built queries (ad hoc) 1. topic fields used Used entire topic with s(jme simple tiltering 2. total computer tilne to build query (cpu seconds) unknown, est. < (~.1 sec. ea. 3. which of tlie following were used? a. term weighting with weights bL'L~Cd on teflns in topics b. phrase extraction from topics C. syntactic pusing of topics d. word sense disLnnbiguL~i()n C. proper IIOUII identific~ti()n algorithm (look up) f. tokenizer (reco(2nizes ~ites, phone numbers, coliuflon pattenis) (1) which pattems (`LrC tokenized? many h. expailsion of queries Usin(T previously-c()nstructed ~ita structure (from part I) (1) which structure? Tapered wind(~w B. Manually constructed queries (ad h(ic) 1. topic fields used User judgment 2. aver~ige tune to build query (minutes) 1-5 minutes 3. type of query builder b. computer system expert 4. tools used to build query a. word frequency list yes b. knowledge base browser (knowledge base described in part I) yes (1) which structure from pail I c. other lexical tools (identify) Lexicon 5. which of the following were used? 503 a. term weighting yes b. Boolean Connectors (AND, OR, NOT) Availal)le. Not used. C. proxilnity operators Automatic d. addition of tenus not jiicluded in topic yes--I)ased on user judgment (1) 5()U~CC of tenfis e. other (describe) C. Feedback (ad hoc) AvailaI)Ie. Not 5U1)Illltted in TREC. D. AutolnaticUly built quenes (routing) Av~ liable. Not Submitted in TREC. E. Manually constructed quenes (I'()utin(') Available. Not 5U1)Illltted in TREC. III. Searcililig A. Tot~~ computer tilne to search (cpu seconds) 2-10 seconds, dep. on (juery 1. reLlieval t~e (tot~ CPU seconds between when a query enters ~e system until a list of document numbers LUC obtained) see above 2. ranking time (t()t11 CPU seconds to sort d('cument list) Included in number above B. Which methods best describe ~()U~ machine se~tiching me~()ds? 1. vector space m(xiel Some teCllIlI(1Ue5 used 2. probabilistic model Some probability used in ranking 5. Boolean m~~ching Available. Not used in TREC. 6. fuzzy logic (include y()Lll defluition) Fuzzy semantic net links used in term explosion. 8. neural networks No--See 6 9. conceptual graph matching Yes--query concept created by explosion C. What factors are included in yow- ranking? 1. tenn fi-equency 2. inverse d(~ument fiequency Available. Not used in TREC. 3. other term weights (where do they COIflC from'?) Manual 4. sem('~tic Closeness (LL~ in sein~tntic net distance) yes 5. position in document Available. Not used In TREC. 6. syntactic clues (state how) Availal)le. Not used in TREC. 7. proximity of terms 9. document lengtli 10. completeness (what (;/,, of the query terms are present) 15. other (specify) User cII()()ses--pr()grammable IV. What machine did YOU conduct ~e TREC experilnent on? Sun SPARC II How much RAM did it have? 64 Mbytes What wa-s the clock rate of ~e CPIJ? 50 MHz V. Some systems are research prototypes and others are commercial. To help compare fliese systems: 1. How much "software en(~~ineering" went into the development of your system? The underlying "engine" used f~)r TREC is also used in a commercial product (C()nQuest)--llence, lots of SIW engineering is behind it. 2. 6iven appropriate resources, could your system be made to ruii f£~ter? By how much (estimate)? Yes--at least a factor (~f 2 3. What features is your system missing that it would benefit by if it had them? Subject domain add-on dictionary. 504 System Summary and Timing GE Research and Development Center General Conuneni~ The timings should be the time to replicate runs from scr£'itch, not including trial runs, etc. The times should also be reasonably accurate. This sometimes will be difficult, such as getting total time ft)r document indexing of huge text sections, or inailually buildinQ a k'iowledge base. Please do your best. I. Construction of indices, knowledge bases, and oIlier data structures (please describe all data structures that your system needs for se~ircliing) We did rn~ pre-indexing of the data B. Statistics on ~ita structures built from TREC text no data provided C. Data built from sources other tliaii the input text --no II. Query construction (please fill out ~`i section for each query construction method used) B. Manually constructed queries (ad 11(X) 1. topic fields used Mostly description, narrative, and concepts 2. aver£'Ige time to build query (minules) Al)out 2(~ minutes fi)r initial query 3. type of query builder b. computer system expert 4. tools used to build query b. knowledge base browser (knowledge bise described in part I) (1) which structure from part I inverted samples of corpus 5. which of the f()llowino were used? b. B(x)lean connectors (AND, OR, NOT) c. proximity operators d. addition of telins not included in topic (1) source of terms system lexicon, statistical analysis of samples matched l)y initial queries C. Feedback (ad hoc) We did not do feedliack, hut we did query refinement E. Manually constructed queries (r()utin~) Ad hoc and routing were d(~ne using the same method 1. topic fields used 2. average time to build query (minutes) Ahout 2(~ minutes for initial query 3. type of query builder b. system expert 4. data used fi)r building query b. from all tr£uning d(Xuments statistical analysis of samples retrieved c. from documents with relevance judgments used for training, testing, and word frequency analysis 5. tools used to build query d. machine aiialysis of ~iining documents (I) describe Word weighting analysis t6 determine what terms to add to queries 505 6. which of the f()ll()wifl(' were u.~ed? b. Booleall C()flIleCtor~ (AN[), OR, NOT) C. proxilnity oper~I(()r.~ C. other (bnef de~Cripti()fl) system lexicon, statistical analysis of samples matched by initial (iueries III. Searching A. Total computer tilne to NeLUCh (Cpu NeC()fldN) 1. retiieval tilne (totil CPU NecondN between when a query enters the system Until a list of document numbers tie obLimed) AI)out 2(~ h(~uI's = 72l)(~) CPU seconds. As the documents are not pre-indexed, this includes all operations oil all documents 2. ranking time (tolil C~U seconds 10 sort d(X'ulnent list) Al)()ut 3()() CPU seconds B. Which methods best descnbe Y()U~ machine searching methods? 5. Boolean matching 7. free text sCan'lin(T C. What faCt()rs are inCluded in your ri'iing? 5. position in doCument 15. other (specify) Numl)er (~f hiL%' on topic description IV. What maChine did you ConduCt the TREC experilnent on'? SUN SPARCstation-2 [low muCh RAM did it have? 48 Meg What w£~~ the clock rate of the (?PU? standard V. Some systems are rese£u~ch prototypes and others ~u.e C()mmerci£'il. To llelp COIn~&UC these systems: Our system used a pattern matcher and lexicon that have l)een commercially developed, but the basic Boolean document processing engine was developed for TREC in a few days 1. How much "software en~ineering" went into the development of your system? 2 days for the l)asic engine 2. Given appr()pflate resources, could your system be made to run t~L~ter? By how much (estimate)? Processing time per document could easily be improved by a factor of 2. Processing time f(Jr ad hoc retrieval could be impr(Jved by a factor of about 1(K~OOO by using an inverted indexing strategy, at a cost of additional storage and indexing time for the corpus. 3. What features is your system missing that it would benefit by if it had them? Automatic query generati~~n, aids f~~r compiling queries from higher-level descriptions 506 System Summary and Timing TRW General Coininenis The fimings should be tlie tilne to replicate runs from scratch, not including tnal runs, etc. Tlie tilnes should also be reasonably accurate. This soinetilflCs will be difticult, such as gettin~ total time lor document indexing of huge text sections, or manually building a k'iowledge base. Please do your best. I. Construction of indices, knowledge bases, and other da~ structures (please describe all da~ structures that your system needs for se~chi'ig) A. Which of ~e following were used to build your da~i structures? None--we used a free text scanning approach. CD-ROM data was decompressed and loaded onto niagnetic disk in raw form. B. Statistics on d~ta structures built from TREC text (please fill out each applicable sectk)n) --none C. Data built from sources other th'ui ~c input text --none II. Query construction (please fill out a section for each query construction method used) A. Automatically built queries (ad hoc) We performed some initial trials with 1)uilding queries l)ased on word frequency taken from documents from the initial relevance judgments supplied l)y NIST. Unfortunately, this appeared to lead us down a l)lind alley, perhapS l)ecause the initial judgments were not all that good. We are planning to try this again with the new judgments. B. Manually constructed queries (£~ h(~) We primarily used this niethod f~)r the TREC queries. 1. topic fields used 2. average t~e to build query (minutes) The initial query would take a couple of nimutes to manually form l)y cutting and pasting from the topic descriptionS with a text editor. "Feedliackt1 was the human looking at the retrieved documents, comparing with the sample good documents supplied l)y NIST, making independent judgments on document relevance, and retining the query in an iterative manner. 3. type of query builder b. computer system expert 4. tools used to build query Ilo special tooLs a. word frequelicy list b. knowledge base browser (knowledge base described in part I) (1) which structure from part I c. other lexical tools (identify) 5. which of the f~)llowinLT were used? b. Boolean connectors (AND, OR, NoT) c. proxilnity operators d. addition of terms not included in topic (1) source of terms Additional terms were supplied l)y human hased on outside knowledge or from reading the text. 507 C. Feedback ~ hoc) 1. iflitiLil query built by me~()d 1 or Ineth(ld 2? Initial (1UCI-ie.~ were 1)uilt by human fl~()m sul)set of topic keywords. 2. type of per.~on doing leedh.ick b. .~y.~te'n expert computer .%ystenl analyst 3. (Lver~Ige time to do complete feedbtek We did this manually. a. cpu tilne (total cpu .`;ec()nd.~ for all iterations) A human refining tlie (lueries fl)r an hour might use 1(~ minutes of FDF time. b. cl(lck time floin initifi construction of query to completion of final query (minutes) Feedl)ackl(1uery refinement was done manually. Some topics were fairly easy, with reasonable results being achieved in less than an hour. Others, took several hours. 4. average number of iterations a. average nuinber of documents exatnined per iteration Typically 2()-3(). 5. minimum number of iterItions ~1aybe 1(~. 6. m~ixiinum number of iterations ~Iaybe 1()(). 7. what determines the end of in iteration? Each iteration is (1) the human updates the (lueries, (2) the machine executes, (3) the Ii uman reviews the retrieved documents. We stopped working oil a topic when it seemed that the results were converging to practical limit for our approach, i.e., adding additional synonym keywords, or changing the (~uery structure, wasn't g()iOg to produce more reasonable results. 8. feedback me~()ds used d. m~~ual methods (1) using individual judgment with 110 Set ~ilgon~m After working through the first dozen (jr so topics, we started to fall into a semi-routine. We are still thinking about the nature of this "routine" and what types of tools could help automate it. E. Manually c()nstructed queries (r()utinL') Same answers as fl)r ad h(Jc. If fact, given our query language approach, final ad hoc queries and r(~uting queries are the same. III. Searching A. Total computer t~e to search (cpu seconds) 1. retrieval tilne (toLil cpu seconds between when a query enters Ilie system until a list of document numbers are obLimed) Time to process a single set of topic queries against 1.2GB is 2-3 minutes. Time to load the tipster corpus (read from CD-RoM, decompress, and load onto FDF's disk) was less than 8 hours. 2. ranking time (total cpu seconds to sort d(lcument list) 1-2 seconds B. Which methods best describe your machine searching me~()ds? 7. flee text scanning T(J perf(~rm the actual searches, we used the fast data finder (FDF) text search hardware. The FDF implements a wide variety of pattern matching functions including w()rdLstringlphrase matching, fuzzy matches, Boolean logic, proximity operators, term counting, term completeness, and numeric ranging. C. VVhat factors are included in your ranking? 5. positi()n in document 7. proxilnity of terms 508 9. document lengtli 10. c()Inpleteness (what ~ of the query ter'n~ £u-e pre.~ent) 12. word specificity (i.e., ~`wiinal vs. dog vs. p()()dle) To provide a c(~rse-grain ranking we ran several (lueries per topic, to provide increasing levels (~f recall. The five methods al)()ve were used in addition to Boolean logic, numeric ranging, and word order. IV. What machine did you conduct ~e Tl~C expelilflellt on? Sun-3/16() with FDF2E)~)() and C-51 disk array How much RAM did it have? 8 NIB What w~~s the clock rate of ~e CPU? A Sun-3116() is a couple of Mips. N(~te that the Sun is just the host, the FDF does the actual pattern matching. The FDF2()()() model used for TREC clocks at around 12 MHz. V. Some Systems ~ue rese~uch prototypes and others ~`u-e co'n'nerci~'d. To help comp~ire ~ese systems: 1. How much `s()ftwL'ue cllginecnng" went ilito the development of your system? No special programming was done for the TREC conference. The FDF system iLself was the result of extensive pri(~r development. 2. Given appropriate resources, could your system be made to run f~L';ter? By how much (estimate)? How fast would y(~u like it to go? The system used to execute the TREC (~ueries was 2()% (~f a full-up system. We're currently working on software that will automatically c(~E)r(linate multiple FDF systems to working in parallel. We're aLso considering faster FDF chips and data transfer methods. 3. VVhat features is y~~ur system missing th~it it would benefit by if it had them? The next generation (~f FI)F systems have an in-hardware term weighting capahility that can l)e used, in c()nll)inati()Il with the existing features, to return a numeric score for a document. This wouki allow f~)r finer grain in ranking. New model prototypes were not availal)le for this eff~,rt. 509 System Summary and Timing vPI & sU General Coininenis The timings should be the tilne to replicate runs from scratch, not including trial run 5, etc. The t~es should also be re~'tsonably accurate. This soifletitnes Will be dimcult, such £~ getting total time ft)r document indexing of huge text sections, or `naliu~'tlly building a k'iowledge b~'Lse. Please do your best. I. Construction of indices, knoWledge bises, and other data structures (ple~~se describe all data structures that your system needs for se~'u'chi'ig) A. Which of ~e folloWing were used to build y~~ur data structures'! 1. st()pW()rd list yes ~I. 110W In my Words in list'? 41 ~ 2. is LI controlled v(~('LbulL'n'y used? 11(~ 3. stCIfl111int~ ~ 4. teilli Weighting Vector and p-n(~rni runs were d~~ne with n(~ term weights. Vector runs were aLso perf~)rIned with aug~n(~rn1 * idf weighting. 5. phrase discovery no 6. syntactic p('~sing 110 7. Word sense dis~~nbiguati()n 11(~ 8. heuristic ~L~s()ciati()ns Ilo 9. spelling checking (With manual collection) no lo. spelling correction 110 11. proper IIOUll identific£'Iti()n £`Ilg()ri~ln As pr(Jvided in SMART 12. tokenizer (recogil izes dItes, phone numbers, C()I~()11 patterils) As provided in SMART 13. £`ire the in~'uiu£'dly-indexed terms used'! not used as suggested in guidelines 14. other techniques used to build d~It~'I structures (brief descuption) 1983 VerSIon of SMART, enhanced with VPI&SU routines B. SL~tistics on ~Ita structures built fR)In TREC text (ple~'Ise fill out each applicable section) Except it you want Us to answer under 4 here re the knowledge l)ase used to help Iluild our Boolean (lueries, please advise. 5. other ~It~'I structures built from TREC text (what?) Document vector tile and term dictionary LI. toLIl £`u'1()unt of storige (IllegIbytes) Approx. 15Nil~ t~)r the dictionary and 121 l,IB for the Document vector file for the entire ~Vall Street journal collection. b. told computer tune to build (`Ippr()xiln~'Ite number of hours) Approx. time t(~ build above lo hours (~n ccrdl (DECstation 5(~N~ Model 25, i.e., a MIPS R3(~()(~ chip running at 2SMHz) C. is the pr('cess completely automatic'! yes d. brief description of methods used The document text is tokenized, stop words are thrown out, and non-noise words are kept in the term dictionary along with its occurrence frequency. Each term ill the dictionary has a unique identitication numller. The vector tile contains for each document its unique ID, and a vector of term ID and weights for the term. The weighting scheme is flexulle and can l)e changed to ()flC (~f several schemes after the indexing is complete. (If necessary we can till in details here. Please advise.) 510 C. Data built from sources other th£ui ~e input text --no II. Query construction (please fill out £`t section for cich query construction method used) A. Automatically built queries (~id hoc) I. topic fields used Description, Narrative, and Concepts. 2. total computer tilne to build query (cpu seconds) Vector queries--5(~ Seconds for ~ topics 3. which of the following were used? a. term weighting wi~ weights b~L~ed on te~s in topics Term weighting was used for vector queries. C. proper noun identific~ition algori~m As provided in SMART f. tokenizer (recognizes d£~tes, phone numbers, coininon patterils) As provided in SMART B. Manually constructed queries (`id hoc) 1. topic fields used Description, Narrative, and Concepts. 2. average time to build query (minutes) 3 mills/query 3. type of query builder b. computer system expert 4. tools used to build query b. knowledge base browser (knowledge base described in p(u~t I) (1) which structure from p~ut I for solliC of our work we build a knowledge base to help suggest broader/narrower terms--added inf~)rInati()n can be provided if appropriate c. other lexical t()()ls (ideutify) vi (editor) 5. which of the following were used? b. Boolean connectors (ANI), ()I~, N(~~) d. £`tddition of terms not included in topic (1) source of temis domain knowledge of experts III. Searching A. Total computer tilne to se~irch (cpu seconds) Approx. 4 minutes for each topic. We did a full sequential pass through documents for this since we did space fi)r the inverted file. 1. retriev£d tilne (total cpu seconds between when a query enters ilie document numbers uc obtained) 2. rankin(2 time (total cpu seconds to sort d('cument list) not have enough disk system until a list of B. Which metliods best describe your m~icliine sen-ching Ine~()ds? Meth~~ds: ~e used three main methods, and a scheme f(a- c()ml)ining the resulLs from those runs 1. vector space In(xlel 5. B()()lean matching 6. fuzzy logic (include y~~ur definition) 1)-norm matching C. What factors ~`ire included in y~~ur rinking? We used several weighting methods in combination with the methods, to get a total of 8 ru~s that were the basis fi)r our sul)missi(~n. We used binary weights, as well as: 1. terin fi-equency 2. inverse d('cument frequency 511 3. oilier tenn weights (where do they come from?) augnorm, c(~mputed by SMART using the above factors IV. What machine did you couduci ilie TREC experilneilt on? DECstati()n 5()(~~~ Model 25 How much RAM did it h~~ve? 4(~M bytes What wa.'; tile clock rate of ~e CPU? MIPS R3()()(~ at 25MHz At the end of our work f~,r the submission, we finally had 3Gbytes of disk storage to work with. V. Some systems [ire rese'.irch protolypes t.~nd other"; we c()Inlnerci[d. To help COInpL.~C tiiese systems: 1. How much "soflwaie eugiucerilig" went into tile development of your system? We began with the 1983 version of SMART, and have enhanced it. We tried to use the new version (~f SMART on an RSI6(i(N~ but could not get reliable results and so went back to our older version. We underwent extensive software development since May but due to lack to disk space could m~t use most of what we developed for the submission. 2. Give'i appropnate resources, could your system be m'.ide to run f~~ter? By how much (estimate)? With m(~re disk space we could have used the inverted tile option and that would have made things much faster. That would have allowed real time interactive searching. Also, with Ill()~C disk space, we c(Juld have used an RSI~~O, a~uming SMART could l)e ported and made fully operational. 3. What features is your system missing thL.a it would benefit by if it had them? Because (~f the disk space problem we were n(~t al)le t(~ do many of the efforts we wanted to do. Work will continue this fall if disks are received in time. Among the features: - phrase identification and matching - building "decisi~~n trees" after training with a sufficient set of relevance judgments - implementing the CE() model (~f P. Thompson and trying it out in a variety of ways to combine results from a variety of runs and indexing schemes (that could include stemming andlor IliorphEJIogical analysis). 512 System Summary and Timing GTE Laboratoijes General Coininents The fimings should be the tilne to replicate runs from saatch, not including trial runs, etc. The tilnes should also be reasonably accurate. This soluetilnes will be difficult, such ~ getting total time for document indexilig of huge text sections, or m~ually building a knowledge base. Pleise do your best. I. Construction of indices, knowledge b('Lses, and other datLi structures (ple~~se describe all data structures that your system needs for sea~ching) A. Which of the following were used to build y~iur d~tta structures? 1. stopword list a. how muly words in list? 28(~ words 2. is a controlled v()c~'ibul'iry used? no 3. steinlnin~ a. st~uid~u-d steininin (T L'4g()rithlns which ones? 1~aice conflation b. m()1i)h()l()gical £ui~dysis Ilo 4. telin weighting yes 5. phrase discovely Ilo 6. syntactic p~~;~ing Ilo 7. word 5C115C dis~unbigu~ition ilo 8. heuristic ~L~s()ciati()ns n(~ 9. spelling checking (with m£mu(il colTectioll) ilo 10. spelling conection Ilo 11. proper noun identificition (ilgori flim Ilo 12. tokenizer (recognizes dates, phone numbers, common p~'itterns) Ilo 13. we the m~uilly-indexed te~s used? no 14. other techiuques used to build ckiti structures (brief descuption) B. Statistics on ~iti structures built floin T~~C text (ple~'ise fill out each applicable section) 1. inverted index a. total £`~()unt of storige (ineg~'tbytes) 336(~ (f~~r the 24(~(J ~B ~4 text) b. totil computer time to build (~ppr()x~~'ite number of hours) 672 c. is the process completely (`tutolnitic? yes d. Lue terin positions wi~in d(icuments stoled? yes e. single terms only? yes 5. other dati structures built flom TREC text (whit?) statistics files a. total `unount of storige (meg~'ibytes) 400 b. to~l computer time to build ((`ipproxilnate number of hours) 24 c. is the pr(icess completely (`wt()m£'itic? yes d. brief description of methods used Index is scanned for fre(luency, location, popularity and record size statistics. The results are used in normalizing tile weighting attril)utes. C. Data built from sonices other th'~ the input text --no II. Query construction (please fill out £1 section for each query construction method used) 513 A. AutoIn£ltic(dty built queries ((id hoc) 1. topic fields used Topic, De.%'cri1)tio)fl, Narrative 2. total computer ti'ne to build query (cpu seconds) 2 seconds 3. which of the following were used? a. term weighting witli weights b£bed on tenns in topics C. syntactic p~u-sing of topics 1. automatic addition of B(x)lean connectors or proximity operators [). Automatic(Llly built queries (r()utin~) 1. topic fields used Topic, De.%cripti()n, Narrative 2. total computer t~e to build query (cpu seconds) 2 secondS 3. which of the f()llowin(2 were used in buildin~~ the query? a. teflns selected from (1) topic b. tenn weighting (1) with weights based oil terms in topics d. syntactic ~(U5~Il~ (1) of topics j . (~ut()In(1tic addition of B()()lCL1'1 connectors or proximity operators (1) using iIlf()rm~~ti()n fr()1n the topics III. Searching A. Tot~il computer time to se~uch (cpu seconds) 1* all t()pic.% *1 1()8E)E)(l 1. retiiev~Ll tillie (total cpu seconds between when a query enters the system until a list of document numbers £ue ()bt(~ined) 72()()() 2. ranking time (totti cpu seconds to sort d('cument list) -36(l()(~ B. Which methods best describe your machiuc se~irching methods? 10. other (describe) niulti-level attril)ute weighting C. What factors are included in your rinking? 1. tenn frequency 2. inverse d(icument frequency 3. other term weights (where do they CoIIIC from?) explicit term weighting hy user 5. position in document 7. proxilility of terins 9. document length 10. completeness (wh(~t Y~. of the query terms £ue present) 15. other (specify) record (d(~unient) id IV. What machine did you conduct the TRE(? experilnent on'? IBNI RSI6()(JE) 32(~ How much RAM did it liLve? 32 NIB What w~i~ the clock ritC of the (?PI J? 25 NIHz V. Some systems are rese~tich prototypes £Lnd others ~ue c()mluerci~d. To help coinpue these systems: 1. How much `software engineering" went into the development of your system? This is a prototype. 2. Given appropriate resources, could y~~ur system be made to run f~~ter? By how much (estimate)? yes, given taster hardware and m(~re RAM, we can pr()l)al)ly douhie the 514 pertomlance. 3. What feature%' i~ your sy~teIn ini.~~i'ig that it would benefit by if it had them? Varial)le .%~ized I)ucket.% to inipleinent linked lists. Iniproved ranking attril)ute range calculation. Spelling c()rrecti(~n. 515 System Summary and Timing Siemens Corporate Research, Inc. General Coininents The timings should be the time to replicate runs from saatch, not including trial runs, CtC ~)C tilnes should also be re~~~onably accurate. This solnetilnes will be diflicult, such ~ getting total time for document indexilig of huge text sections, or m(~ually building a ~()wledge b£~LsC. Ple~Lse do YoUr best. Summary of method: Completely aut()mJ tic vector matching where both document and (iuery vectors have l)eefl expanded using syn~~nyms extracted from W()rdNet. I. Consti-uction of indices, know ledge b~i~es. ~~nd other da~ sti-uctures (please describe all data structures that your system needs for seuching) A. Which of the following were used to build your data sti-uctures? 1. st()pw()rd list ~i. how many words in list? 571 word st()pw()rd list used (standard SMART st()pword list) 2. is a controlled v(~abulafy used? Ilo 3. stenlinin~' ~ stand-ud steinining ~-LIg()ri thins which olles? b. m()i~ph()l()gic(~l (-ulilysis Extremely simple suffix stripper to look words up in W()rdNet. (Checks for olle of 22 suflixes and p()ssil)ly modifies end (~f stem if a matching suffix is found. This was in code I inherited--I don't know the source of the sufrix list, l)ut the list is a sul)set (~f that used l)y SMART, so it prol)al)ly comes fl~()m SolliC "standard" algorithm.) All words aLso pass through the "triestem" stemmer (~f SMART. This stemmer was originally hased on I~()vin's CACM article, l)ut has evolved over the years. 4. telin weighung A tf*idf weight is used fi)r hoth i~uery and document terms, where the weight is further n()rniali~'~ed so that an inner product computation produces the cosine ("tfc" weights using the ternimology of "Term ~eighting Approaches in Automatic Text Retrieval" l)y Silt(~n and l~uckley). A term is counted as appearing in a document (for idf purposes) if it was in the original text ()V If it was added as a synonym. The tt~idf portion of an added term's weight is multiplied hy .8 to produce its final weight. 5. phr~'L~e disc()veI~ ~t. wh(~1t kind of phr(-~se? b. usin~ stitisticLI Ineth('ds c. usintT s~tactic methods W()rdNet contains c(,ll()cati()ns as meml)ers of synonym sets, so some phrases may l)e added as synonyms. However, such a collocation is assigned a uni(lue concept numl)er and will (~nly match that exact collocation (so I don't consider it to l)e "phrasing"). No other phrasing used. 6. syntactic p(-u;~in(T Ilo 7. word sense dis(-unbiguati~~n No specific sense disaIiil)iguati()n procedure used. If a term ~~ccurs in more than one ~(~rdNet syn~~nym set (which, hy definition, means that it is polysemous), the syn~~nynis from all of its senses may potentially he added to the vector. The 516 algorithm re(luires that at least two original text words agree on a synonym hefore it is added to the vector. The effect (~f this is to do a ~()()~ man's version of sense disaml)iguati()n for the synonyms. 8. heunstic £~~s()ci£'ttions a. short definition of these ~sociations W()rdNet synonymy relation only association used. 9. spelling checking (with inatiull con-ection) no 10. spelling correction no 11. proper i)OUII identificition ~`tlgori~in no 12. tokenizer (reco(,'nizes d~-~tes, phone numbers, coi~on pattenis) no 13. £u-e the mL~1u~-tlly-indexed terms used? no 14. other techniques used to build d£'Lta structures (brief description) no B. Statistics on data structures built from FREC text (please fill out each applicable section) 1. inverted index a. total ~~ount of st()r~~2e (ineg£-~b ytes) 947 megal)ytes (~f disk storage b. total computer time to build (appr()xilnate number of hours) 5 hours to l)uild index given document vectors; document vectors took 37 hours t(J l)uild from text. Thus, approximately 42 hours to go from text to inverted index. C. is the ~R~C55 completely ~-1ut()In~ttic? yes d. ~u-e terin positions wi~iii d(Xulnents stored'? No term position information maintained. e. single tCrins olily? Single terms only (although, as stated al)()ve, a single term from WordNet may l)e a collocation such as `electrical_discharge'). 2. n-gr£-uns, suffix aiTays, siL'nature tiles N-grams and signature tiles not used. SMART stemmer algorithm incorporates a (static) trie of suffixes. 3. knowledge bases No knowledge l)ase used other than W()rdNet (descril)ed under I.C.2). C. Data built from sources other th~-~ ~e input text 1. inteni~i]ly-built auxiliai-y tiles Il()~C 2. externuly-built ~-~uxili~'u-y lile a. type of tile ~~-eebank, \V()rdNet, etc.) W()rdNet (noun portion only) b. t()tL~l c. total d. type ~~()uI1t of st()r-t~'e (IneLT-Ibytes) 5 megal)ytes number of concepts represented 35,155 syn~~nym sets (67,293 word senses) of represeflt(-iti()Il (fr(-~es, ,~eIn~-~tic nets, rules, etc.) We used only the syn~~nymy relation that W()rdNet contains. However, W()rdNet contains many other lexical relationships making it similar to a semantic net. II. Query construction (please fill out a section 1;()r etch query colisti-uction method used) [We sul)mitted oliC set of results; those results were for automatically huilt ad hoc (lueries.] A. Autom~-ttic~tlly built queries (ad hoc) 1. topic fields used C(~ncepts (<con>), Description (<desc>), Factors (<fac>), Narrative (<narr>), Nationality (<nat>), Title (<title>) 2. total computer titne to build query (cpu seconds) 1 second, ~ average (5~) seconds to index 50 (lueries) 517 3. which of the f()ll()win~ were used? ~L. telin weighting wi~ weights based on te~s in topics Yes, as descril)ed al)()ve d. word sense dis(UnbigUL~ti()n Ouly ~s (1escril)ed al)()VC (two original (~uery ternis must agree on a synonym to l)e `1(1(led). h. exp~rnsi()n of qUeries using previ()Usly-c()nst1~ucted d~ta struetwe (from part I) (1) which snucture? ~V()rdNet. III. Searching A. TotLil computer tilne to se~'uch (cpU secouds) 1. retrieval tjine (t()t£il CPU seconds between when a query enters ~e system until a list. of document nunibers ~ue ()btLtined) 15 CPU seconds, (P11 average (756.4 cpu seconds to ~r()CC55 So (lueries) 2. r~~nking time (totil CPU seconds to sort d(~UInent list) not applical)le: list (~f top 2(H) similarities maintained while searching B. Which methods best describe your `n~chine se~nching methods.? 1. vector 5~(LCC model C. VVhat factors ~ue included in your r£~nking? 1. telin frequency 2. inverse d(~uInent frequency 4. seln'wtic closeness (L'L~ in selnintic ilet distance) (synonyms) 9. docullient leng~ 13. word sense frequency (nouns with only ()11C sense in ~V()rdNet get all their synonyms added) IV. What machine did YOU coilduct the Fl~i£('. experimeut oil? ~un II~X [low much 1~AM did it h(~Ive? 64 megal)ytes Wh~it w'~s the clock nite of ~e Cl~t J'? 4E)~1Hz V. Some systems ~`u-e rese(~Uch prototypes (md others ne c()mmerci~~d. To help c()1np~U'e ~ese systems: 1. [low much "s()ftw~ire engilleering" went into the development of your system? Our system Is a version of SMART with Ilb)dlfled indexing C(Pde. SMART has l)een well-engineered (but its main goal is tleXil)ility, not raw speed). Little time was spent (Pptimizing our Illodjfications. 2. (jiveli (~ippr()pfl'L'ite resources. could ~()U~ system be made to ruii f~~ster? By hoW much (estimate)? SMART could pr()l)al)ly l)e made to run s~~mewhat taster it' it were made less tiexible, that is, it' we coded a version that performed only the sorts of runs we made here. I doubt the difference would be dramatic. Preprocessing steps perfi~rmed on W()rdNet could impr(Pve the efficiency (~f the expansion code. 3. WhIt fe'ttures is Y()U~ system missing th~it it would benefit by if it had them? Incorporating part-~~f-speech tagging s(P that we could kn(Pw it' the term is a noun befiPre looking it up in ~Vi~rdNet should be beneficIal (we didn't do this for TREC because the tagger we have is fairly slow). In the same vein, a true sense disaml)Iguat(Pr--a way (Pf picking the c(Prrect W()rdNet synonym set--would clearly help, but I d(Pn't kn~~w of a way of doing th at automatically yet (it is part of. our research). 518 *u.s. (;.P.O.:1993-341-931:82636