the face of non-exhaustive relevance judgements. When precision rates are around 30%, and a fur- ther (in the L = 1,000 case) 30% of documents are unjudged, there can be no significance what- soever attached to the difference between even 30% precision and 40% precision. Indeed, as- suming that 27.1% of the unjudged documents are relevant for the "L = 1,000; Doc~ combina- tion gives a final precision of 0.411; the corre- sponding number for the "All; Doc" pairing is only 0.373. Thus, the precision figures of Table 1 are sufficiently imprecise that no conclusion can be drawn about the appropriate value of L that should be used, and about the merits of docu- ment vs. paged retrieval. There is clearly scope for research into other methodologies for compar- ing retrieval mechanisms. 3 Structured documents Many of the documents in the TREC collection are very large and have explicit structure, and it may be possible to use this structure-rather than the statistically based pagination methods described abov~to break documents into parts. In particular, many documents can be broken up into a set of sections, each section having a type. There has been relatively little work done on re- trieving or ranking partial documents. However, Salton et al. [12] have demonstrated that docu- ment structure can be valuable. Sometimes this structure is explicitly available [2], and sometimes it has to be discovered [5], but the knowledge of this structure has been shown to help deter- mine the relevance of sub-documents. In this part of the work we used a small database to investigate whether retrieval of sections helped document retrieval, and whether retrieval of doc- uments helped section retrieval. By way of a benchmark, the paged retrieval techniques de- scribed earlier were applied to the same database. 3.1 The database Since we needed information about the relevance of sections to queries it was not possible to use the full TREC database. Instead, we used a database consisting of 4,000 documents extracted from the Federal Register collection. These documents were selected as being the 2,000 largest docu- ments which were relevant to at least one of topics 51-100 provided for the first TREC experiment. Another 2,000 documents were randomly selected from the Federal Register collection to provide both smaller documents and non-relevant docu- ments. The average number of words in these documents was 3,260. These documents were then split into sections based on their internal markup. The documents had a number of tags inserted that defined an in- ternal structure. It appeared that only the T2 and T3 tags could be reliably used to indicate a new internal fragment. Section breaks were de- fined to be a blank line, or a line containing only markup, followed by a T2 or a T3 tag. This led to a database of 32,737 sections. Each of these sections had a type based on its tag. The types were (purpose), (abstract), (start), (summary), (title), (supplementary), and a general category (misc) that included all remaining categories. Having made the document selections, only 19 of the queries 51-100 had a relevant document in the collection. Each of the sections for doc- uments that had been judged as relevant was judged for relevance against these queries so that finer grained retrieval experiments were possi- ble. One difficulty that arose was that quite a few documents that had been judged relevant ap- peared to have no relevant sections-there were relevant key terms in the documents but the doc- uments themselves did not appear to address the information requirement. There were 145 such (query, document) pairs. To be consistent, we took these document to be irrelevant. After these alterations, only 14 queries had a relevant section, and there were an average of 23 relevant sections per query. 3.2 Structured ranking ~`e carried out a set of experiments on rank- ing documents using the retrieval of sections. We first compared simple ranking of documents against ranking sections to find relevant docu- ments. Next, a set of formulae were devised that attempted to use the fact that one document has several sections that might be more or less highly ranked. These took into consideration the rank of the section, the number of ranked sections, and the number of sections in the document. Exper- iment 3 describes one of the more successful for- mulas. Further trials were then performed using the type of the section. First, a set of experiments were run that determined which types were bet- ter predictors of relevance. These results were then used to devise a measure that used a weight for each type. Finally, we tried to combine these 188