Next: 5 Conclusion Up: The Sequoia 2000 Electronic Previous: 3 GIPSY: Automatic Georeferencing

4 TextTiles: Enhancing Retrieval through Automatic Sub-Topic Identification

Full-length documents have only recently become available online in large quantities, although technical abstracts and short newswire texts have been accessible for many years [15]. For this reason, most information retrieval methods are better suited for accessing abstracts than longer documents. Part of the repository research was a exploration of new approaches to information retrieval particularly suited to full-length texts, such as those expected in the Sequoia 2000 database.

A problem with applying traditional information retrieval methods to full-length text documents is that the structure of full-length documents is quite different from that of abstracts. One way to view an expository text is as a sequence of subtopics set against a ``backdrop'' of one or two main topics. A long text will be comprised of many different subtopics which may be related to one another and to the backdrop in many different ways. The main topics of a text are discussed in its abstract, if one exists, but subtopics usually are not mentioned. Therefore, instead of querying against the entire content of a document, a user should be able to issue a query about a coherent subpart, or subtopic, of a full-length document, and that subtopic should be specifiable with respect to the document's main topic(s).

Consider a Discover magazine article about the Magellan space probe's exploration of Venus. A reader divided this 23-paragraph article into the following segments with the labels shown, where the numbers indicate paragraph numbers:

Assume that the topic of ``volcanic activity'', or perhaps ``geological activity'', is of interest to a user. Crucial to a system's decision to retrieve this document is the knowledge that a dense discussion of volcanic activity, rather than a passing reference, appears. Since volcanism is not one of the text's two main topics, we shouldn't expect the number of references to this term to dominate the statistics of term frequency. On the other hand, we don't necessarily want to select a document just because there are a few references to the target terms.

The goal should be to determine whether or not a relevant discussion of a concept or topic appears. A simple approach to distinguishing between a true discussion and a passing reference is to determine the ``locality'' of the references. In the computer science operating systems literature, ``locality'' refers to the fact that over time memory access patterns tend to concentrate in localized clusters, rather than being distributed evenly throughout memory. Similarly, in full-length texts, if a set of references to a particular concept occur in close proximity to one another, this is a good indicator of topicality. For example, the term volcanism occurs 5 times in the Magellan article, the first four instances of which occur in 4 adjacent paragraphs, along with accompanying discussion. In contrast, the term scientists, which is not a valid subtopic, occurs 13 times, distributed somewhat evenly throughout. By its very nature, a subtopic will not be discussed throughout an entire text. Similarly, true subtopics are not indicated by only passing references. The term belly dancer occurs only once, and its related terms are confined to the one sentence it appears in. As its usage is only a passing reference, ``belly dancing'' is not a true subtopic of this text.

Our solution to the problem of retaining valid subtopical discussions while at the same time avoiding being fooled by passing references is to make use of locality information and to partition documents according to their subtopical structure. This approach's capacity for improving a standard information retrieval task has been verified by information retrieval experiments using full-text test collections[21,22].

4.1 How to find Subtopic Structure

One way to get an approximation to subtopic structure is to break the document into paragraphs, or for very long documents, sections. In both cases this entails using the orthographic marking supplied by the author to determine topic boundaries.

Another way to approximate local structure in long documents is to hack the documents into even-size pieces, without regard for any boundaries, but we are interested in exploring the performance of motivated segmentation, i.e., segmentation that reflects the text's true underlying subtopic structure, which often spans paragraph boundaries.

Toward this end, we have developed TextTiling, a method for partitioning full-length text documents into coherent multi-paragraph units [20,21,23]. TextTiling approximates the subtopic structure of a document by using patterns of lexical connectivity to find coherent subdiscussions. The layout of the `tiles' is meant to reflect the pattern of subtopics contained in an expository text. The approach uses quantitative lexical analyses to determine the extent of the tiles and to classify them with respect to a general knowledge base. The tiles have been found to correspond well to human judgements of the major subtopic boundaries of science magazine articles.

The algorithm is a two step process; first, all pairs of adjacent blocks of text (where blocks are usually 3-5 sentences long) are compared and assigned a similarity value, and then the resulting sequence of similarity values, after being graphed and smoothed, is examined for peaks and valleys. High similarity values, implying that the adjacent blocks cohere well, tend to form peaks, whereas low similarity values, indicating a potential boundary between tiles, create valleys. Figure 13 shows such a graph for the magazine article mentioned in Section 1. The vertical lines indicate where human judges thought the topic boundaries should be placed.

Figure 13: Results of TextTiling a 77-sentence popular science article. Vertical lines indicate actual topic boundaries as determined by human judges, and the graph indicates computed similarity of adjacent blocks of text. Peaks indicate coherency, and valleys indicate potential breaks between tiles.

The one adjustable parameter is the size of the block used for comparison. This value, labeled k, varies slightly from text to text; as a heuristic it is assigned the average paragraph length (in sentences), although the block size that best matches the human judgement data is sometimes one sentence greater or fewer. Actual paragraphs are not used because their lengths can be highly irregular, leading to unbalanced comparisons.

Similarity is measured by putting a twist on the tf.idf measurement [12]. In standard tf.idf, terms that are frequent in an individual document but relatively infrequent throughout the corpus are considered to be good distinguishers of the contents of the individual document. In TextTiling, each block of k sentences is treated as a unit unto itself, and the frequency of a term within each block is compared to its frequency in the entire document. This helps bring out a distinction between local and global extent of terms; if a term is discussed frequently but within a localized cluster (thus indicating a cohesive passage), then it will be weighted more heavily than if it appears frequently but scattered evenly throughout the entire document, or infrequently within one block. Thus if adjacent blocks share many terms, and those shared terms are weighted heavily, there is strong evidence that the adjacent blocks cohere with one another.

Similarity between blocks is calculated by a cosine measure: given two text blocks b1 and b2,

where t ranges over all the terms in the document and is the tf.idf weight assigned to term t in block b1. Thus if the similarity score between two blocks is high, then not only do the blocks have terms in common, but the terms they have in common are relatively rare with respect to the rest of the document. The evidence in the reverse is not as conclusive: if adjacent blocks have a low similarity measure, this does not necessarily mean they don't cohere; however, in practice this negative evidence is often justified.

The graph is then smoothed using a discrete convolution of the similarity function with the function , where:

The result is smoothed further with a simple median smoothing algorithm [24], with a window of size three, to eliminate small local minima. Tile boundaries are determined by locating the lowermost portions of valleys in the resulting plot. The actual values of the similarity measures are not taken into account; the relative differences are what are of consequence.

Retrieval processing should reflect the assumption that full-length text is meaningfully different in structure from abstracts and short articles. We have conducted retrieval experiments that demonstrate taking text structure into account can produce better results than using full-length documents in the standard way[20,21,23]. By working within this paradigm, we've developed an approach to vector-space based retrieval that appears to work better than retrieving against entire documents or against segments or paragraphs alone.

Instead, a query is matched against motivated segments, and then the scores from the top segments for each document are summed. The highest resulting sums indicate which documents should be retrieved. In our test set, this method produced higher precision and recall than retrieving against entire documents or against segments or paragraphs alone[21]. Although the vector-space model of retrieval was used for these experiments, the probabilistic models such as that used in Lassen are equally applicable, and the method should provide similar improvement in retrieval performance.

We believe that recognizing the structure of full-length text for the purposes of information retrieval is very important and will produce considerable improvement in retrieval effectiveness over most existing similarity-based techniques.

Next: 5 Conclusion Up: The Sequoia 2000 Electronic Previous: 3 GIPSY: Automatic Georeferencing

Christian Plaunt
School of Information Management and Systems
UC Berkeley
chris@bliss.berkeley.edu
Wed Dec 20 16:41:32 PST 1995