Document Clustering in Concept Space:
The NIST Information Retrieval Visualization Engine (NIRVE)
Christine Piatko, Christine.Piatko@jhuapl.edu
Johns Hopkins University / Applied Physics Laboratory (JHU/APL)
Johns Hopkins Road
Laurel, MD 20723-6099
Contribution of the National Institute of Standards and Technology.
Not subject to copyright.
Abstract
The NIST Information Retrieval Visualization Engine (NIRVE) is a
3-dimensional interface designed to enhance the information retrieval
process by providing an overview of a set of text documents as well as
access to the details of the individual documents through seamless
user navigation and manipulation.
Overview and Goals
Ever larger collections of documents are becoming accessible to casual
users via powerful search and retrieval algorithms. However, a user's
query to a search engine can often result in hundreds of potentially
relevant documents. We believe that interactive 3D visualization
techniques, used correctly, can be a powerful medium in which large
amounts of information can be comprehensibly presented.
The goal is to give users 1) a seamless view of text document result
sets, at both a general and detailed level and 2) a powerful
set of operations through which the user can organize, filter,
and inspect groups of documents.
Interactivity and Empowerment
Our working assumption is that most of the
fully-automated selection that can be done already has been done by
the front-end search engine. Typically, this consists of keyword
matching - useful as far as it goes, but semantically, rather shallow.
Absent some powerful natural language processing, the human user is
the only agent in the human-machine system capable of deep
understanding of a document's relevance to his/her intentions.
Therefore, the task for the visualization back-end is not to attempt
further automatic processing, but rather to empower the user to
browse, re-categorize and filter the documents at hand.
NIRVE Design
Input Content
NIRVE assumes only that a search engine will deliver to it a set of
documents to be displayed and the set of keywords in the query that
generated that document set. The information available for each
document includes:
-
A score, computed by the search engine, which is interpreted
as an estimate of how well the document matches the query.
-
The sequential rank of the document within the set, as derived from the
score.
-
Document title.
-
Document URL (this allows viewing of the document text via a web browser).
-
Document length (in any consistent units).
-
Number of occurrences of each keyword in the document.
Note that the title and keyword count are the only clues to semantic
content. To the extent that additional fields become systematically
available (e.g. author, date, document type) NIRVE could be extended
to represent these as well.
Screen Design
NIRVE displays three windows to the user, as illustrated in the
following figures.
The NIRVE Control window
is a control menu from which the user can select among the
various operations discussed below. Although any entry in the menu
can be selected at any time, the order of entries is designed
to match approximately a typical search strategy: basic organizational
operations are at the top, then viewing, then style details, and
finally summarization and quitting the session.
The Concept Control window is used to specify the
mapping of query keywords into more comprehensive concepts, and also
to assign a "weight" or importance to each of the resulting concepts.
The Document Space window contains a symbolic view of a set
of documents. The primary information represented for each document is
its concept profile,
but other properties may also be displayed (see
"Iconic Representation"). The documents are arranged in groups called
clusters, based on similarity of their concept profiles. Each cluster
has an icon that displays the average profile for the documents
therein. There are a number of operations available in this window,
as described under "View modes".
Keyword to Concept Mapping
The purpose of the set of keywords used as a query is to gather up all
the relevant documents. To this end, it makes sense to add in
synonyms and redundant phrases. But for the purposes of classifying
the result set, extra keywords simply generate superfluous
dimensions in the resulting space. Therefore, we allow the user to map
the set of keywords into a presumably smaller set of concepts. In the
current version, the concept names are a subset of the keyword names;
this is slightly restrictive, but it saves typing. E.g. "twister"
and "tornado" might both be mapped into "tornado". Note that even if a
more advanced search engine provides some automatic synonym
support and query expansion, this still would not address
the deeper concept mapping a user might require.
The user is presented with a square matrix in the Concept
Control window (see figure), initialized to the identity mapping.
By clicking on the cells of the matrix, he/she may specify any desired
mapping, i.e. any subset of keywords may be mapped into each concept.
Typically, the mapping will be such that every keyword is mapped into
exactly one concept, but this is not mandatory.
Clustering
An interesting design issue is how to assign a document its position
in concept space, given its length, the number of keyword occurrences,
and the keyword to concept mapping. There is probably no ideal
metric; we compute the strength of a keyword for a document based on
the normalized (between 0 and 1) square root of the ratio of keyword
occurrences to document length. The concept strength is then computed
based on the strengths of the constituent keywords.
We refer to the set of concept strengths
computed for each document as its concept profile. We treat
this profile as the main source of semantic information about the
document, other than the text itself. The profile is interpreted as
the location of the document in n-dimensional concept
space, where n is the number of active concepts,
i.e. those into which at least one keyword has been mapped.
When computing the individual concept coordinates, we have found
it to be a useful heuristic to substitute a negative value (such as
-0.5) for zero; this serves to emphasize the semantic difference
between a document with a one or a few occurrences of a given keyword
and a document with none at all.
The user can assign different weights to the concepts in order to
express his/her judgments about which are most important. This is
done through a set of sliders in the Concept Control window
(see figure).
This window also associates each named concept with a unique color.
Concept weights are used for two purposes: to affect the circular
sequence used in the document space window (see below) and to
calculate a scalar concept score for each document.
Given the position of each document in concept space, we define the
distance between pairs of documents according to the usual Euclidean
metric. That space may be uniform in each dimension, or scaled
according to the weight assigned by the concept weight sliders. For
the scaled metric, a higher weight emphasizes differences among
documents with different values for that concept.
We then wish to determine a compact circular order for the documents
(note that this is simply the multiscaling problem of reducing N
dimensions to one), on the premise that documents with similar concept
profiles are likely to be similar in meaning and therefore should be
kept geometrically close to each other. We are currently using a
simple nearest neighbor heuristic to find a reasonably short path.
The ordering allows sequential scanning of the documents by the user,
thus making it easier to be sure that no documents have been
overlooked. It also supports a simple algorithm to cluster documents
into semantically related groups.
Within the circular sequence of documents, the gaps between adjacent
documents vary in size. We define a cluster as a maximal
subsequence of documents within which no gap exceeds a given
threshhold. By setting the gap threshhold higher or lower, NIRVE
generates fewer or more clusters, respectively. Thus, the user can
dynamically specify finer or coarser cluster granularity.
Iconic representation
The resulting clusters and constituent documents are displayed as
icons. The basic information displayed for each icon is a graphical
representation of its concept profile. This is displayed as a colored
histogram, one bar per concept (see figure for Document
Space). The correspondence between color and
concept is shown in the Concept Control window. The
document icons are flat (so that they can be packed together and
visually compared) and so their histograms are 2D. The cluster icons
have 3D concept histograms so that they are more readily discernible
from many viewpoints.
The clusters of documents are arranged in document space
in rays or spoke-like patterns around a circle. The icons sit
perpendicular to the XZ plane, with a cluster icon on the outer edge
of each ray. The spacing between adjacent documents within a cluster
subsequence is proportional to the distance between them in concept
space. Similarly, the angular distance between adjacent clusters
in document space is proportional to the distance between their
concept profiles in concept space.
The result is that clusters and documents which are close
geometrically are also close semantically. Furthermore, since
similar documents are lined up as parallel squares, they can
easily be compared visually. The converse does not
always hold: semantically close documents or clusters may not wind up
being close geometrically; this result is inherent in the problem of
representing higher dimensional entities in a lower dimensional
setting. Nonetheless, semantically similar documents will usually be
in the same cluster, even if separated within the cluster.
Note the distinction between concept space and document space. The
former is the abstract n-dimensional space within which a document's
concept profile specifies its precise location. The latter is a
3-dimensional space in which visible document icons are heuristically
arranged for the user's inspection and manipulation.
Document icons may be decorated, at the user's option, with small
glyphs for such attributes as document length, document score assigned
by the search engine, and document score based on concept weights.
This latter concept score is basically a sum of products of the
document's concept coordinates times the user-specified weight for the
corresponding concept.
View modes
The Document Space display can be manipulated in several
ways:
In spin mode, the display rotates around its natural axis, allowing
the user to see the general nature and location of the clusters. It
is also useful for scanning through all the clusters to see which
appear promising or not. As we shall see, a cluster judged to be
irrelevant can be deleted from the display.
In move mode, the mouse can be used to rotate and translate
the display to give any desired view.
In pick mode, the mouse is used to operate on documents or clusters.
Simple cursor motion causes the indicated icon to be highlighted
and associated text to be displayed: the title for document
icons, and a profile summary for cluster icons.
Mouse-button 1 displays the details of the selected icon: the full
text of a selected document or a generated text summary of a selected
cluster is displayed via Netscape. Mouse-button 2 is used to mark the
icon with a user-specified relevance status, as described in the next
section. Mouse-button 3 causes the display to be rotated and shifted
so that the selected icon is brought front and center for closer
inspection.
Display Filtering
In pick mode, mouse-button 2 assigns a user-specified relevance status
to a selected document or cluster, cycling through "good", "bad", and
"unsure". When a cluster is marked, all its documents inherit the
value. The current value is denoted by a small flag attached to the
upper right of the icon: red for bad, yellow for unsure, green for
good. The user can then designate any subset of documents to be
displayed or suppressed, based on these values (e.g. display only
"good" or "unsure" documents).
Web Summary
Finally, the user can generate a textual web page summary
(example below) of the
contents of the entire Document Space window. The page is organized
using the same clusters and document order as that currently on
display. The purpose is twofold: for the user it provides a
convenient archiving mechanism; for researchers it provides a baseline
for separating out the effect on users' performance of the clustering
per se and the clustering plus 3D graphical presentation.
Implementation
NIRVE is currently implemented for SGI platforms using Xlib, OpenGL
and Tcl/Tk. Tcl/Tk generates and manages the NIRVE Control
window. The other windows are managed by a second process, using
OpenGL to generate output and Xlib to handle input. Both processes
are essentially X event handlers, and they communicate by sending X
events back and forth. There are no known dependencies on
SGI-specific features.
Conclusions
We believe that NIRVE will prove to be a useful vehicle for studying
the effects of 3D visualization for information retrieval. The code
can easily be instrumented to record patterns of usage. Furthermore,
we plan to compare user performance among various degraded versions of
NIRVE in order to separate out the effects of its various features,
such as the use of keyword consolidation into concepts and 3D vs. 2D.