Interactive 3D Visualization for Document Retrieval

John Cugini, cuz@nist.gov
Dr. Christine Piatko, cpiatko@nist.gov
Dr. Sharon Laskowski, Group Leader, sharon.laskowski@nist.gov

Visualization and Virtual Reality Group
Information Access and User Interfaces Division
Information Technology Laboratory
NIST
Gaithersburg, MD 20899
Phone: (301) 975-6033; Fax: (301) 840-1357

Contribution of the National Institute of Standards and Technology. Not subject to copyright.

Abstract

The availability of large collections of documents coupled with powerful search and retrieval algorithms provides the opportunity for people to access large sets of relevant documents in electronic form. However, often a user query can result in hundreds of potentially useful documents returned by the system. Our hypothesis is that interactive 3D graphics techniques can be used to help the user comprehend and filter such result sets. We describe some prototypes developed at NIST in pursuit of these goals and discuss associated design issues, such as icon appearance, layout within 3D space, and interaction mechanisms.

1. Problem Statement

The VVR group has been exploring the use of information visualization in conjunction with several projects in the Information Access and User Interfaces Division. In particular, we are developing an advanced visual interface to an existing NIST statistical text retrieval system called PRISE.

Statistical search engines such as PRISE efficiently retrieve a ranked list of potentially relevant documents from a large document collection based on a user query. In PRISE, the query is simply a set of terms (also called "keywords") that the user believes occur in the documents of interest. The usual display for the resulting set of retrieved documents is a linear text list of document titles. This display does not, however, constitute a full solution to the retrieval problem. Among the difficulties are:

The returned list may be very large. In PRISE the user can designate an arbitrary limit, but in general it is not unusual for search engines to return hundreds or even thousands of document titles.
The user must scan through the list, deciding which documents to ignore and which to inspect further.
Full text is available only by request per individual document.
Although the order of the list makes it more likely that useful documents are at the top, the "best" document from the user's perspective may be towards the end.
The search engine may order documents according to its own weighting of query terms; this weighting may be very different from the one that would be assigned by the user.
The user may find a "good" document within the list and wish to see other closely related entries.
There may be no way to get a global view of what the list contains, such as which query terms are associated with many retrieved documents and which with few.

We are assuming throughout that the query itself does not leave much scope for improvement: that the user has already refined the query so as to include useful terms and delete irrelevant ones. Thus, the user cannot simply generate a "better" list of documents, but rather must manipulate the existing one. Such manipulation has previously been explored from an information retrieval rather than a visualization perspective [HP96].

2. Visualization as a Solution

Our goal is to demonstrate that three dimensional visualization can be used to interactively filter the retrieval set and help the user find those documents which are most relevant more quickly than a traditional scrolled list. This seems plausible given the presumed strengths of interactive 3D visualization techniques:

Visualization helps users comprehend large quantities of data.
Visual attributes can present abstract representations of data.
Relationships among displayed entities become apparent.
Graphical techniques allow more direct intuitive interactions with the entities of interest.

How then can we represent the data contained in the result set of a document query so as to allow the user to find the relevant entities and discover patterns of interest?

Below, we describe three particular visualization paradigms we have developed to address such goals. The first paradigm, a document spiral, emphasizes the relevance ranking assigned to the documents by the search engine. The second paradigm, the three-keyword axes display, emphasizes the frequency of particular query terms in the documents. The third paradigm, nearest neighbor clustering, is used to show how documents are related to each other based on their query term frequency. Each paradigm allows the user to navigate through the display, and to dynamically change the display based on the user's original query keywords. We conclude by describing avenues for future work.

3. Display Paradigms

Some general design rules and constraints guided our approach:

For each document, PRISE supplies its title, rank, relevance score, length, and the set of keywords found within the document. We augmented PRISE to count occurrences of keywords by document.
Each document is represented by a single icon.
The main information encoded by each document icon is a graphical display of each keyword's "strength" (explained in more detail below) within that document. This reflects our judgment that keyword count is the best metric available within PRISE for the basic semantic content of the document. Each keyword is associated with a distinctive color.
Each document icon also has a yellow bar proportional to the associated document length.
Semantic properties of a document may be represented by either the appearance or location of its icon.
The user should be able to manipulate the view of the scene as a whole, e.g. by rotating and translating.
The keywords of the generating query are themselves objects of interest and can be used to manipulate the icon display.
The user should be able to select individual icons and display the associated title or text. In our models, positioning the mouse over an icon causes the title to appear at the bottom of the window, and clicking on the icon causes the full text to be displayed in a separate window.

Now we describe the specific paradigms we have implemented and discuss how well they address the problems outlined in section 1.

3.1 Document Spiral

The major organizing principle of the document spiral paradigm is that documents with higher relevance scores should be more centrally located than those with lower scores. All the icons are initially located on the XZ plane. The highest ranked document icon is placed in the center of the plane. Subsequent document icons are arranged along a spiral from the center, spaced in a manner proportional to their relative document score. A document density parameter gives the user some flexibility over how tightly the icons are packed along the spiral. Documents with exactly the same score are placed next to each other to keep them from overlapping.

Thus, the model does not stray too far from the spirit of the original list organization: the rank order of documents is quite apparent. The major insight added by the spiral is a visual representation of the distribution of relevance scores. It is easy to see at a glance if the score distribution is relatively uniform or if there are just a few documents with high scores at the center of spiral, with most of the rest the periphery.

Keyword Sliders

Keyword sliders allow the user to accentuate documents which contain keywords considered to be especially important. The user can manipulate the sliders to control the weighting factor for each keyword. The sliders are color-matched with the keyword-strength display in the icons. Documents whose keyword strengths match heavily weighted factors are emphasized by elevation of the document icon above the usual XZ plane of the spiral.

This interaction can be performed in two modes. In "OR" mode, the elevation = sum of (factor(i) * strength(i)) where i indexes the factors and strengths for each keyword. This has the effect that if a document has any keyword with a positive strength for which the user has specified a positive factor, that document will be elevated above the default plane.

In "AND" mode, we add the condition that if a document has any keyword with zero strength for which the user has specified a positive factor, that document will not be elevated, i.e. a document is elevated if and only if it has a positive strength for every positive factor.

3.2 Document Three-Keyword Axes Display

As an alternative to the direct display of the rank information, we allow the user to display the document icons in three dimensions based on the keyword strength statistics. We call this the three-keyword axes display.

Of course, a query could contain many more than three keywords. We do not, therefore, limit the user to one keyword per axis. Rather, the user may assign any subset of the keywords to each axis. This association is controlled by a separate keyword window (similar to the slider window) with a column of checkboxes for the X, Y and Z axes. Each axis is labelled with its associated keyword(s). The location of the icon on an axis is computed as the average of the constituent keyword strengths, and the position of the icons is dynamically updated as the axis choices are changed. Of course the individual icons still encode the strength for all the keywords.

Since the relevance score is not encoded in the position of the icon (as is true in the document spiral), each icon also has a pale pink bar proportional in length to the document's score.

The main advantage of this display is that it allows the user to interactively explore how the query terms relate to the retrieved documents. For example, choosing the same keyword for all three axes presents a linearly ordered list along X=Y=Z in order of keyword strength. If one particular term is very important to user, this line of documents is in rank order based on that single term's frequency. Choosing a different keyword for one of the axes then shows a plane of documents containing the two terms. Finally, choosing three keywords gives a scatterplot display that can be examined for interesting extreme points and possible clustering. By dynamically selecting terms, the user can get a better understanding of the distribution of various query terms in the documents. Interacting with the display may also help the user identify common document phrases. For example, lines of icons with the same slope indicate documents with the same proportion of keyword strengths.

An obvious disadvantage of this method is that if more than three keywords are mapped to the axes, the resulting location of the icons does not provide a unique indication of the represented keyword strengths. Moreover, documents which do not contain any of the chosen keywords are not displayed at all. Another disadvantage is that rank and relevance score must also be encoded in the document icon.

3.3 Nearest Neighbor Sequence

The nearest neighbor paradigm is a heuristic to get semantically similar documents to cluster in the same spatial region. Each document has a "position" in semantic space, represented as a vector of keyword strengths. However, since there are at most three spatial dimensions for visualization but potentially many more keywords, we cannot simply map documents directly from "keyword space" to geometric space. Instead, we try to find a linear order that keeps semantically close documents if not adjacent, then at least nearby.

As an initial approach, we used a simple nearest neighbor algorithm to order the documents: given some initial choice, each document in the sequence is the nearest (of those not already in the sequence) neighbor to its predecessor. The semantic distance between any two documents is based on each document's keyword strength vector and can be computed in two ways: simple Euclidean distance, or as the angle between the vectors.

The re-ordered document icons are then laid out in a circle on the XZ plane with each icon perpendicular to that plane (similar to photographic slides in a carousel tray). The keyword frequencies of adjacent documents can then be compared visually, by looking through one icon to the one lined up (almost) behind it. The spacing between documents is proportional to the semantic distance computed as part of the nearest neighbor algorithm. As in the document-spiral model, keyword sliders are provided to modify the elevation of the icons.

While our current "clustering layout" scheme is simple, we found that neighboring document icons spaced close together did have visually similar graphs of their keyword strengths.

4. Conclusions

A major advantage of all the visualization paradigms is that they give the user an immediate overview of the retrieval set. Hundreds of document icons can be displayed simultaneously, versus tens of document titles for a scrollable text list. In addition, the graphical display of keyword strengths makes it easy to understand at a glance the relative distribution of terms in a document.

Another advantage is that the user can manipulate the retrieval set interactively while maintaining context. This is particularly true for the keyword sliders, as described above.

One complication in arranging the document icons is that it is difficult for a user to distinguish them visually if they are placed so they overlap or are very close to each other. This might be remedied by detecting document collisions and resolving them in some way (such as the artificial spacing within the spiral) or highlighting them (using color, transparency, etc.) and allowing the user to separate these document clusters interactively.

5. Future Work

We plan to develop more effective representations for ranked lists and more effective document displays (such as TileBars [Hea95]) embedded in a three dimensional space. We also hope to explore visualization in conjunction with other information retrieval search engine features, such as relevance feedback to help a user to refine a query with additional query terms.

The nearest neighbor paradigm touched on the use of clustering and the display of document clusters. We hope to develop new paradigms for computing, displaying and interacting with document clusters.

While we have developed displays for a ranked list resulting from a single query, we have not yet built on the notion of a workspace, supporting sets of queries and boolean combinations of them. This will require developing new methods to display and manipulate multiple result sets.

Finally, there remains the very important issue of how to measure and evaluate the effectiveness of these kinds of advanced displays for information retrieval. We expect to conduct evaluations of some of these visualization paradigms during the coming year. We hope to demonstrate what instrumentation is suitable for evaluating advanced visualization tools in the context of information retrieval.

Bibliography

[Cha95] M. Chalmers. Design perspectives in visualising complex information. In S. Spaccapietra and R. Jain, editors, Proc 3rd IFIP Visual Databases Conferences,Visual Database Systems 3, pages 103--111. Chapman and Hall, 1995.
[Hea95] M. A. Hearst. TileBars: Visualization of term distribution information in full text information access. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, pages 213--220, April 1995.
[HKW] M. Hemmje, C. Kunkel, and A. Willet. LyberWorld -- a visualization user interface supporting full-text retrieval. In SIGIR 94.
[HP96] M. A. Hearst and J. O. Pederson. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In SIGIR 96, August 1996.
[PSHD96] P. Pirolli, P. Schank, M. Hearst, and C. Diehl. Scatter/gather browsing communicates the topic structure of a very large text collection. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, pages 213--220, April 1996.
[VB] A. Veerasamy and N. J. Belkin. Evaluation of a tool for visualization of information retrieval results. In SIGIR 96, August 1996.

Appendix: Implementation Notes

The PRISE Zclient interface was built with C and Tcl/Tk. We chose OpenGL and X-Windows for the 3D graphical display and relied on X for communication between the two kinds of windows.

The major software packages used to build the system are OpenGL, Xlib, and Tcl/Tk. Because PRISE currently runs only on the Sun, and because we wanted to take advantage of the performance of OpenGL on SGI platforms, we implemented the system as two processes. The control process is written in Tcl/Tk; it displays the main menu, runs on the Sun (although displayed on the SGI), and communicates with the graphics process via X events. The graphics process runs on the SGI and is a fairly conventional Xlib/OpenGL application. Xlib is used only to catch X events (i.e. essentially for input); all graphics output is done via OpenGL. X events are generated either directly by the user (via the mouse), or by the control process.

Performance is quite acceptable for at least a few hundred document icons. There is very little perceptible delay in document picking or in scene manipulation.

Our current main obstacle to a more interactive tool lies in the extra computation time required to compute the keyword strength statistics, since the PRISE search engine does not return these directly.

Main menu controls

A separate window is used to control several different modes for manipulating the document icon display. The following modes are supported:

Document picking mode: Either active, which freezes the scene, and allows the user get the title or full text of the document from the icon, or inactive.
Viewing mode: User can cause the scene to spin around its natural axis at a steady speed, pause, or control the view directly with the mouse. In the latter case, the left button controls latitude and longitude, and the other two control XYZ translation.
Spin control: Causes scene to spin faster or slower or in reverse.
Keyword logic: apply keyword elevation using "AND" or "OR" logic.
Icon style: The keyword strengths can be displayed on the icon as a bar graph ("Polyline"), a bar graph with the area filled in ("Area"), or a histogram ("Histogram").

Keyword Strength

Keyword frequency is the number of times the keyword appears in the document divided by the document length. These keyword frequencies are then normalized by dividing by the maximum frequency in the collection, leaving a value between zero and one. Finally, "keyword strength" is obtained by taking the square root of the normalized frequency, so as to enhance the visibility of small values.

Options

There are a few command-line options to control the overall appearance of the display:

Use orthographic or perspective view.
Density: how tightly should the icons be packed? This applies only to the document spiral and nearest-neighbor sequence.
Distance metric: compute distance between documents using the Euclidean metric or the angle between keyword strength vectors. This applies only to the nearest-neighbor sequence.

The PRISE system

In response to a user supplied list of query terms, the PRISE retrieval system returns a ranked list of documents according to a score estimating relevance of the document to the query. The ranked list consists of a series of brief records describing each document. A separate request is necessary to retrieve the full document text for each record.

Note in particular that PRISE does not provide the number of occurrences of each query term in the document, nor the occurrence count for any other terms (such as possible synonyms). We made some modifications to compute how many times each (stemmed) query term appeared in each document. We implemented two methods to compute these extra document statistics. One method uses a Perl script to control a batch mode interface to PRISE to compute keyword counts for small batches of documents. The other method uses a modified Zclient Tcl/Tk/C interface to interactively retrieve each document in a retrieval list and compute the keyword counts. However, we found that either method was prohibitively slow for a long list of documents. In order to achieve interactive rates, we would need to modify the search engine to return keyword statistics in the standard brief document records.