Document Clustering in Concept Space: The NIST Information Retrieval Visualization Engine (NIRVE)

John Cugini, cuz@nist.gov
Sharon Laskowski, sharon.laskowski@nist.gov
Building 225, Room A-216
National Institute of Standards and Technology (NIST)
Gaithersburg, MD 20899

Christine Piatko, Christine.Piatko@jhuapl.edu
Johns Hopkins University / Applied Physics Laboratory (JHU/APL)
Johns Hopkins Road
Laurel, MD 20723-6099


Contribution of the National Institute of Standards and Technology. Not subject to copyright.

Abstract

The NIST Information Retrieval Visualization Engine (NIRVE) is a 3-dimensional interface designed to enhance the information retrieval process by providing an overview of a set of text documents as well as access to the details of the individual documents through seamless user navigation and manipulation.

Overview and Goals

Ever larger collections of documents are becoming accessible to casual users via powerful search and retrieval algorithms. However, a user's query to a search engine can often result in hundreds of potentially relevant documents. We believe that interactive 3D visualization techniques, used correctly, can be a powerful medium in which large amounts of information can be comprehensibly presented. The goal is to give users 1) a seamless view of text document result sets, at both a general and detailed level and 2) a powerful set of operations through which the user can organize, filter, and inspect groups of documents.

Interactivity and Empowerment

Our working assumption is that most of the fully-automated selection that can be done already has been done by the front-end search engine. Typically, this consists of keyword matching - useful as far as it goes, but semantically, rather shallow. Absent some powerful natural language processing, the human user is the only agent in the human-machine system capable of deep understanding of a document's relevance to his/her intentions. Therefore, the task for the visualization back-end is not to attempt further automatic processing, but rather to empower the user to browse, re-categorize and filter the documents at hand.

NIRVE Design

Input Content

NIRVE assumes only that a search engine will deliver to it a set of documents to be displayed and the set of keywords in the query that generated that document set. The information available for each document includes:
  1. A score, computed by the search engine, which is interpreted as an estimate of how well the document matches the query.
  2. The sequential rank of the document within the set, as derived from the score.
  3. Document title.
  4. Document URL (this allows viewing of the document text via a web browser).
  5. Document length (in any consistent units).
  6. Number of occurrences of each keyword in the document.
Note that the title and keyword count are the only clues to semantic content. To the extent that additional fields become systematically available (e.g. author, date, document type) NIRVE could be extended to represent these as well.

Screen Design

NIRVE displays three windows to the user, as illustrated in the following figures.

The NIRVE Control window is a control menu from which the user can select among the various operations discussed below. Although any entry in the menu can be selected at any time, the order of entries is designed to match approximately a typical search strategy: basic organizational operations are at the top, then viewing, then style details, and finally summarization and quitting the session.

The Concept Control window is used to specify the mapping of query keywords into more comprehensive concepts, and also to assign a "weight" or importance to each of the resulting concepts.

The Document Space window contains a symbolic view of a set of documents. The primary information represented for each document is its concept profile, but other properties may also be displayed (see "Iconic Representation"). The documents are arranged in groups called clusters, based on similarity of their concept profiles. Each cluster has an icon that displays the average profile for the documents therein. There are a number of operations available in this window, as described under "View modes".

Keyword to Concept Mapping

The purpose of the set of keywords used as a query is to gather up all the relevant documents. To this end, it makes sense to add in synonyms and redundant phrases. But for the purposes of classifying the result set, extra keywords simply generate superfluous dimensions in the resulting space. Therefore, we allow the user to map the set of keywords into a presumably smaller set of concepts. In the current version, the concept names are a subset of the keyword names; this is slightly restrictive, but it saves typing. E.g. "twister" and "tornado" might both be mapped into "tornado". Note that even if a more advanced search engine provides some automatic synonym support and query expansion, this still would not address the deeper concept mapping a user might require.

The user is presented with a square matrix in the Concept Control window (see figure), initialized to the identity mapping. By clicking on the cells of the matrix, he/she may specify any desired mapping, i.e. any subset of keywords may be mapped into each concept. Typically, the mapping will be such that every keyword is mapped into exactly one concept, but this is not mandatory.

Clustering

An interesting design issue is how to assign a document its position in concept space, given its length, the number of keyword occurrences, and the keyword to concept mapping. There is probably no ideal metric; we compute the strength of a keyword for a document based on the normalized (between 0 and 1) square root of the ratio of keyword occurrences to document length. The concept strength is then computed based on the strengths of the constituent keywords. We refer to the set of concept strengths computed for each document as its concept profile. We treat this profile as the main source of semantic information about the document, other than the text itself. The profile is interpreted as the location of the document in n-dimensional concept space, where n is the number of active concepts, i.e. those into which at least one keyword has been mapped.

When computing the individual concept coordinates, we have found it to be a useful heuristic to substitute a negative value (such as -0.5) for zero; this serves to emphasize the semantic difference between a document with a one or a few occurrences of a given keyword and a document with none at all.

The user can assign different weights to the concepts in order to express his/her judgments about which are most important. This is done through a set of sliders in the Concept Control window (see figure). This window also associates each named concept with a unique color. Concept weights are used for two purposes: to affect the circular sequence used in the document space window (see below) and to calculate a scalar concept score for each document.

Given the position of each document in concept space, we define the distance between pairs of documents according to the usual Euclidean metric. That space may be uniform in each dimension, or scaled according to the weight assigned by the concept weight sliders. For the scaled metric, a higher weight emphasizes differences among documents with different values for that concept.

We then wish to determine a compact circular order for the documents (note that this is simply the multiscaling problem of reducing N dimensions to one), on the premise that documents with similar concept profiles are likely to be similar in meaning and therefore should be kept geometrically close to each other. We are currently using a simple nearest neighbor heuristic to find a reasonably short path.

The ordering allows sequential scanning of the documents by the user, thus making it easier to be sure that no documents have been overlooked. It also supports a simple algorithm to cluster documents into semantically related groups.

Within the circular sequence of documents, the gaps between adjacent documents vary in size. We define a cluster as a maximal subsequence of documents within which no gap exceeds a given threshhold. By setting the gap threshhold higher or lower, NIRVE generates fewer or more clusters, respectively. Thus, the user can dynamically specify finer or coarser cluster granularity.

Iconic representation

The resulting clusters and constituent documents are displayed as icons. The basic information displayed for each icon is a graphical representation of its concept profile. This is displayed as a colored histogram, one bar per concept (see figure for Document Space). The correspondence between color and concept is shown in the Concept Control window. The document icons are flat (so that they can be packed together and visually compared) and so their histograms are 2D. The cluster icons have 3D concept histograms so that they are more readily discernible from many viewpoints.

The clusters of documents are arranged in document space in rays or spoke-like patterns around a circle. The icons sit perpendicular to the XZ plane, with a cluster icon on the outer edge of each ray. The spacing between adjacent documents within a cluster subsequence is proportional to the distance between them in concept space. Similarly, the angular distance between adjacent clusters in document space is proportional to the distance between their concept profiles in concept space.

The result is that clusters and documents which are close geometrically are also close semantically. Furthermore, since similar documents are lined up as parallel squares, they can easily be compared visually. The converse does not always hold: semantically close documents or clusters may not wind up being close geometrically; this result is inherent in the problem of representing higher dimensional entities in a lower dimensional setting. Nonetheless, semantically similar documents will usually be in the same cluster, even if separated within the cluster.

Note the distinction between concept space and document space. The former is the abstract n-dimensional space within which a document's concept profile specifies its precise location. The latter is a 3-dimensional space in which visible document icons are heuristically arranged for the user's inspection and manipulation.

Document icons may be decorated, at the user's option, with small glyphs for such attributes as document length, document score assigned by the search engine, and document score based on concept weights. This latter concept score is basically a sum of products of the document's concept coordinates times the user-specified weight for the corresponding concept.

View modes

The Document Space display can be manipulated in several ways:

In spin mode, the display rotates around its natural axis, allowing the user to see the general nature and location of the clusters. It is also useful for scanning through all the clusters to see which appear promising or not. As we shall see, a cluster judged to be irrelevant can be deleted from the display.

In move mode, the mouse can be used to rotate and translate the display to give any desired view.

In pick mode, the mouse is used to operate on documents or clusters. Simple cursor motion causes the indicated icon to be highlighted and associated text to be displayed: the title for document icons, and a profile summary for cluster icons. Mouse-button 1 displays the details of the selected icon: the full text of a selected document or a generated text summary of a selected cluster is displayed via Netscape. Mouse-button 2 is used to mark the icon with a user-specified relevance status, as described in the next section. Mouse-button 3 causes the display to be rotated and shifted so that the selected icon is brought front and center for closer inspection.

Display Filtering

In pick mode, mouse-button 2 assigns a user-specified relevance status to a selected document or cluster, cycling through "good", "bad", and "unsure". When a cluster is marked, all its documents inherit the value. The current value is denoted by a small flag attached to the upper right of the icon: red for bad, yellow for unsure, green for good. The user can then designate any subset of documents to be displayed or suppressed, based on these values (e.g. display only "good" or "unsure" documents).

Web Summary

Finally, the user can generate a textual web page summary (example below) of the contents of the entire Document Space window. The page is organized using the same clusters and document order as that currently on display. The purpose is twofold: for the user it provides a convenient archiving mechanism; for researchers it provides a baseline for separating out the effect on users' performance of the clustering per se and the clustering plus 3D graphical presentation.

Implementation

NIRVE is currently implemented for SGI platforms using Xlib, OpenGL and Tcl/Tk. Tcl/Tk generates and manages the NIRVE Control window. The other windows are managed by a second process, using OpenGL to generate output and Xlib to handle input. Both processes are essentially X event handlers, and they communicate by sending X events back and forth. There are no known dependencies on SGI-specific features.

Conclusions

We believe that NIRVE will prove to be a useful vehicle for studying the effects of 3D visualization for information retrieval. The code can easily be instrumented to record patterns of usage. Furthermore, we plan to compare user performance among various degraded versions of NIRVE in order to separate out the effects of its various features, such as the use of keyword consolidation into concepts and 3D vs. 2D.