the index. Again, there are many variants on inverted file representations [10]. Inverted files have proved the only technique which sup- ports interactive access to very large databases; this is prima- rily due to the excessive I/O requirements of the other methods. Within the family of inverted file methods, several variants are possible: * Simple Inverted Files. In a simple inverted file, the index consists of a list of documents in which each word occurs. * Inverted Files with Weights. In an inverted file with weights, the above information is supplemented with weights, in support of vector retrieval methods. * Structured Inverted Files. In a structured inverted file, the index captures structural information (typically the paragraph/sentence/word coordinates of each occurrence of each term), in support of boolean retrieval methods which rely on proximity operators (e.g. word adjacency). In this research, we have decided to explore the use of structured inverted files for the following reasons: * They support the proximity operations required as part of boolean retrieval systems. This support makes architec- tures based on structured inverted files more likely to be adopted by the major on-line vendors. * They provide a flindamentally richer representation of document structure than is available with the other meth- ods. * They are collection-independent and retrieval-method independent. This last point requires some explanation. In a distributed environment, it would be useful to be able to search text files at multiple locations as if they were a single text file. Once weights - which are both collection-dependent and retrieval- method dependent - are incorporated into the index, trans- parent distributed access becomes impossible. The product of this is that the results of a single query applied to multiple databases cannot be meanin~ly combined, since there is no way to compare the ranks or scores applied to the documents returued. One of the primary challenges associated with the use of structured inverted files is their bulk: there are approximately as many index entries in the inverted file as there are words in the database, and each entry must represent a document, para- graph, sentence, and word coordinates for that word. This problem has been substantially solved by use of a novel coin- bination of compression techniques [11], which allow struc- tured indexes having a bulk on the order of 1/3 the size of the full text to be constructed. The second challenge associated with the use of structured inverted files is to implement the standard information retrieval model using them. The results of this effort are reported in this paper. 118 The final challenge associated with structured inverted files is using them to implement methods which go beyond the standard model, taking into account the added richness the structured representation to improve retrieval system peffor- mance for databases having long, structured documents. This final challenge remains a topic for future research, and there are no results to report at this time. B. The Database Architecture The database consists of the following structures: * A Compressed Structured Posting File. The first compo- nent of the database is an array of compressed postings. Each posting gives the location of a word expressed as a document-paragraph-sentence-word 4-tuple. The postings are sorted by word D, in ascending document order. A Lexicon/Index. The second component of the database is a lexicon[mdex. For each word in the database, it stores the number of times itoccurs plus the location of its post- ings in the posting file. The lexicon is structured so that the information pertaining to a given word may be qulckly located. * Document Information. The final component of the data- base is an array of document-related information. Each entry contains a mapping from an internal document iden- tifier (an integer) to an external document identifier (a string). Additional information, such as the length of the document or normalization information, may be added as necessary. The following operations are supported: * Adding Postings. A 5-tuple consisting of a word plus its four coordinates may be added to the database. * Extracting Postings. A word is presented to the database. An array having the decompressed coordinates of all occurrences of that word is returned. The array is sorted by ascending document coordinate, with paragraph, sentence, and word coordinates used as additional sort keys as needed. * Extracting Lexicon Information. Lexicon information such as the number of occurrences of a word may be extracted. * Setting Lexicon Information. Supplemental lexicon information (e.g. part-of-speech information) may be added to the lexicon. This feature is not used in the work reported herein. * Delining a Document. Defines document-specific infor- mation such as external document identifier, document location on disk, etc., and associates it with a document coordinate. * Extracting Document Information. Extracts the docu- ment-specific information mentioned above, given a docu- ment coordinate.