the face of non-exhaustive relevance judgements.
When precision rates are around 30%, and a fur-
ther (in the L = 1,000 case) 30% of documents
are unjudged, there can be no significance what-
soever attached to the difference between even
30% precision and 40% precision.    Indeed, as-
suming that 27.1% of the unjudged documents
are relevant for the "L = 1,000; Doc~ combina-
tion gives a final precision of 0.411; the corre-
sponding number for the "All; Doc" pairing is
only 0.373. Thus, the precision figures of Table 1
are sufficiently imprecise that no conclusion can
be drawn about the appropriate value of L that
should be used, and about the merits of docu-
ment vs. paged retrieval. There is clearly scope
for research into other methodologies for compar-
ing retrieval mechanisms.


3   Structured documents

Many of the documents in the TREC collection
are very large and have explicit structure, and
it may be possible to use this structure-rather
than the statistically based pagination methods
described abov~to break documents into parts.
In particular, many documents can be broken up
into a set of sections, each section having a type.
There has been relatively little work done on re-
trieving or ranking partial documents. However,
Salton et al. [12] have demonstrated that docu-
ment structure can be valuable. Sometimes this
structure is explicitly available [2], and sometimes
it has to be discovered [5], but the knowledge
of this structure has been shown to help deter-
mine the relevance of sub-documents.   In this
part of the work we used a small database to
investigate whether retrieval of sections helped
document retrieval, and whether retrieval of doc-
uments helped section retrieval.  By way of a
benchmark, the paged retrieval techniques de-
scribed earlier were applied to the same database.

3.1  The database

Since we needed information about the relevance
of sections to queries it was not possible to use the
full TREC database. Instead, we used a database
consisting of 4,000 documents extracted from the
Federal Register collection.  These documents
were selected as being the 2,000 largest docu-
ments which were relevant to at least one of topics
51-100 provided for the first TREC experiment.
Another 2,000 documents were randomly selected
from the Federal Register collection to provide

  both smaller documents and non-relevant docu-
  ments.  The average number of words in these
  documents was 3,260.
      These documents were then split into sections
  based on their internal markup. The documents
  had a number of tags inserted that defined an in-
  ternal structure.  It appeared that only the T2
  and T3 tags could be reliably used to indicate a
  new internal fragment. Section breaks were de-
  fined to be a blank line, or a line containing only
  markup, followed by a T2 or a T3 tag. This led
  to a database of 32,737 sections.  Each of these
  sections had a type based on its tag. The types
  were (purpose), (abstract), (start), (summary),
  (title), (supplementary), and a general category
  (misc) that included all remaining categories.
      Having made the document selections, only 19
  of the queries 51-100 had a relevant document
  in the collection. Each of the sections for doc-
  uments that had been judged as relevant was
  judged for relevance against these queries so that
  finer grained retrieval experiments were possi-
  ble.  One difficulty that arose was that quite a
  few documents that had been judged relevant ap-
  peared to have no relevant sections-there were
  relevant key terms in the documents but the doc-
  uments themselves did not appear to address the
  information requirement.    There were 145 such
  (query, document) pairs.    To be consistent, we
  took these document to be irrelevant. After these
  alterations, only 14 queries had a relevant section,
  and there were an average of 23 relevant sections
  per query.

  3.2    Structured ranking

  ~`e carried out a set of experiments on rank-
  ing documents using the retrieval of sections.
  We first compared simple ranking of documents
  against ranking sections to find relevant docu-
  ments. Next, a set of formulae were devised that
  attempted to use the fact that one document has
  several sections that might be more or less highly
  ranked. These took into consideration the rank
  of the section, the number of ranked sections, and
  the number of sections in the document. Exper-
  iment 3 describes one of the more successful for-
  mulas.
      Further trials were then performed using the
  type of the section. First, a set of experiments
  were run that determined which types were bet-
  ter predictors of relevance.  These results were
  then used to devise a measure that used a weight
  for each type. Finally, we tried to combine these


188