the index. Again, there are many variants on inverted file
   representations [10].
   Inverted files have proved the only technique which sup-
 ports interactive access to very large databases; this is prima-
 rily due to the excessive I/O requirements of the other
 methods. Within the family of inverted file methods, several
 variants are possible:
 * Simple Inverted Files. In a simple inverted file, the index
   consists of a list of documents in which each word occurs.
 * Inverted Files with Weights. In an inverted file with
   weights, the above information is supplemented with
   weights, in support of vector retrieval methods.
 * Structured Inverted Files. In a structured inverted file,
   the index captures structural information (typically the
   paragraph/sentence/word coordinates of each occurrence
   of each term), in support of boolean retrieval methods
   which rely on proximity operators (e.g. word adjacency).
   In this research, we have decided to explore the use of
 structured inverted files for the following reasons:
 * They support the proximity operations required as part of
   boolean retrieval systems. This support makes architec-
   tures based on structured inverted files more likely to be
   adopted by the major on-line vendors.
 * They provide a flindamentally richer representation of
   document structure than is available with the other meth-
   ods.
 * They are collection-independent and retrieval-method
   independent.
   This last point requires some explanation. In a distributed
 environment, it would be useful to be able to search text files
 at multiple locations as if they were a single text file. Once
 weights - which are both collection-dependent and retrieval-
 method dependent - are incorporated into the index, trans-
 parent distributed access becomes impossible. The product of
 this is that the results of a single query applied to multiple
 databases cannot be meanin~ly combined, since there is no
 way to compare the ranks or scores applied to the documents
 returued.
   One of the primary challenges associated with the use of
 structured inverted files is their bulk: there are approximately
 as many index entries in the inverted file as there are words in
 the database, and each entry must represent a document, para-
 graph, sentence, and word coordinates for that word. This
 problem has been substantially solved by use of a novel coin-
 bination of compression techniques [11], which allow struc-
 tured indexes having a bulk on the order of 1/3 the size of the
 full text to be constructed.
   The second challenge associated with the use of structured
 inverted files is to implement the standard information
 retrieval model using them. The results of this effort are
 reported in this paper.


                                         118

  The final challenge associated with structured inverted files
 is using them to implement methods which go beyond the
 standard model, taking into account the added richness the
 structured representation to improve retrieval system peffor-
 mance for databases having long, structured documents. This
 final challenge remains a topic for future research, and there
 are no results to report at this time.
 B. The Database Architecture
  The database consists of the following structures:
 * A Compressed Structured Posting File. The first compo-
   nent of the database is an array of compressed postings.
   Each posting gives the location of a word expressed as a
   document-paragraph-sentence-word 4-tuple. The postings
   are sorted by word D, in ascending document order.
   A Lexicon/Index. The second component of the database
   is a lexicon[mdex. For each word in the database, it stores
   the number of times itoccurs plus the location of its post-
   ings in the posting file. The lexicon is structured so that the
   information pertaining to a given word may be qulckly
   located.
 * Document Information. The final component of the data-
   base is an array of document-related information. Each
   entry contains a mapping from an internal document iden-
   tifier (an integer) to an external document identifier (a
   string). Additional information, such as the length of the
   document or normalization information, may be added as
   necessary.
  The following operations are supported:
 * Adding Postings. A 5-tuple consisting of a word plus its
   four coordinates may be added to the database.
 * Extracting Postings. A word is presented to the database.
   An array having the decompressed coordinates of all
   occurrences of that word is returned. The array is sorted by
   ascending document coordinate, with paragraph, sentence,
   and word coordinates used as additional sort keys as
   needed.
 * Extracting Lexicon Information. Lexicon information
   such as the number of occurrences of a word may be
   extracted.
 * Setting Lexicon Information. Supplemental lexicon
   information (e.g. part-of-speech information) may be
   added to the lexicon. This feature is not used in the work
   reported herein.
 * Delining a Document. Defines document-specific infor-
   mation such as external document identifier, document
   location on disk, etc., and associates it with a document
   coordinate.
 * Extracting Document Information. Extracts the docu-
   ment-specific information mentioned above, given a docu-
   ment coordinate.