IR-LIB Workshop paper

Presenting Search Results:
Design, Visualization, and Evaluation

John Cugini / cuz@nist.gov
Information Technology Laboratory
National Institute of Standards and Technology (NIST)
Gaithersburg, MD 20899

Contribution of the National Institute of Standards and Technology. Not subject to copyright. Reference to specific commercial products or brands is for information purposes only; no endorsement or recommendation by the National Institute of Standards and Technology, explicit or implicit, is intended.

Submitted to the Information Doors -- Where Information Search and Hypertext Link workshop, May 30, 2000, San Antonio, Texas (held in conjunction with the ACM Hypertext and Digital Libraries conferences).

This paper is accessible at: http://www.itl.nist.gov/iaui/vvrg/cugini/irlib/paper-may2000.html

Abstract

A number of projects at the National Institute of Standards and Technology (NIST) have addressed the generation, evaluation, and presentation of search results. This paper contains a general characterization of the presentation problem, and an outline of the components of an associated evaluation system. Within the design space for this problem, we distinguish between the logical structure imposed on the result set and the interface by which the structured results are presented to the user. This interface includes the operations provided for the manipulation of the set as well as its visual presentation. Any design, no matter how intuitively appealing, should be evaluated and the full array of issues for HCI testing then comes into play. In particular, researchers must decide on a base case against which to measure results, whether to use high-level and/or low-level metrics, and which tasks are appropriate for the evaluation.

Keywords

Search engine, search results, user interface, information retrieval, visualization, usability evaluation.

1. Relevant NIST Experience and Projects

The major mission of the Information Access Division [IAD] of NIST is to accelerate the development of technologies that allow intuitive and efficient access, manipulation, and exchange of complex information. We contribute to those goals primarily via measurement methods and standards. Several of our projects, described below, are particularly relevant to the problem of designing an interface by which users can examine and manipulate the set of documents resulting from an automatic search. In this paper, we present a general characterization of this problem and we propose a method for systematic evaluation of such interfaces.

One of the lessons of our experience is that no matter how much intuitive appeal a given interface might have, without some systematic testing, its real value remains unknown. Especially in the field of visualization, it is all too common for technical wizardry to be unaccompanied by any real gain in efficiency. Several IAD programs have demonstrated the value of applying a consistent set of tests to emerging technologies. Evaluation is a crucial link in the feedback loop. It focuses the development effort and gives some sense of where progress is being made.

The annual Text REtrieval Conference [TREC] has become the premiere forum for the quantitative evaluation of search engines. Note that this is primarily a comparison of the quality of result sets returned by various engines for a given query, not of the mode of presentation. However, for the past few years the interactive track [INTREC] of TREC has experimented with methods for evaluating the user interface.
For the past four years, the NIST Information Retrieval Visualization Engine [NIRVE] project has explored the potential value of 3-D visualization in helping users understand and manipulate search results. One paper emerging from this project describes the design of NIRVE and also presents a brief survey of other attempts to visualize result sets [Cugi00]. Another describes an in-depth user evaluation performed on variants of NIRVE [Sebr99].
Two projects directly address usability testing. The I-USR [IUSR] project is developing standards for reporting the results of testing with users to allow sharing of usability data between software development and purchasing companies. The WebMetrics [WEBM] project researches and develops software prototypes, notably including tools for logging user activity and visualizing the resulting log files, in order to support rapid, remote, and automated usability testing of web-based applications,
In a consulting project for the Social Security Administration, NIST researchers developed a system for the automatic generation of semantic hyperlinks to allow easier navigation within a large body of text. Furthermore, this was followed up by a user study to help evaluate and refine the prototype [Tebb99].
Finally, we have recently undertaken a project called IRLIB to address a problem that has long plagued scientific communities: limited access to older citations. The main goals of IRLIB are to determine how best to make this material accessible and to investigate methods for remote testing of the usability of digital libraries. During late 1998 and early 1999 IAD built a very small-scale digital library of some older citations of interest to the IR community, consisting of six proceedings, one book, and one NIST monograph. These were scanned, using OCR, and made web-accessible for searching.

2. Design Framework for Search Results

Can we do better than ranked lists? The intuition that the answer must be "yes" has motivated several research projects in visualization, as mentioned above [Cugi00]. We set aside the problem of producing a better result set and take the actual result set as a given. How then should the set be organized and presented so as to allow users to accomplish their goals?

2.1 Information Resources

Let us start by listing the information that may be available as input to some automated process, which we will refer to as the result interface manager (RIM).

From each resulting document:

Query term occurrence: At the least it would be useful for the RIM to know which subset of query terms occurs in each document. It would be better still if the search engine returned a count of the number of occurrences of each query term. Since these are the terms the user thought important enough to put in the query in the first place, it seems reasonable to assume that they should help to determine the way in which the result set is structured. Some search engines, however, do not return this information.
Full term vector: For a managed digital library, the complete term vector (i.e. including all significant terms, not just those matching the query) of a document gives a more complete characterization and may be pre-computed. For an unmanaged collection, such as the entire web, acquiring this information is more difficult.
Full text: Of course, the logical extreme is to make the full text of the returned articles available to the RIM; in theory, this allows the deepest analysis and organization of the results, e.g. via natural language processing techniques. Its practicality is unclear however, first because of the time needed to relay full text to the RIM (especially if over the web), second because of the time needed to analyze and compare the text of many documents. Nonetheless, this approach is being used by some meta-search engines [Lawr99].
Score assigned by search engine: Most engines return some sort of metric estimating how well the document matches the query; indeed, this statistic is usually the basis for the ordering of the returned list. While innovative approaches seek to transcend this single statistic, it nonetheless provides potentially useful information.
Metadata: There will often be specific fields of data associated with an article, such as title, length, date, dateline (i.e. location), or author and clearly these may contribute to the user interface. While the length of a returned document may itself not be of primary interest, it may be useful for normalizing query term statistics, so as not to give undue weight to long documents. Note that a title often carries great semantic weight, since it supposedly identifies the main theme. Perhaps it would be worthwhile to apply some natural language techniques to just the title?

From the document database:

Global information about the database from which the documents are drawn can help the RIM place the individual entries in context. For instance, if the user wants "recent" documents, the actual cutoff date used would depend on the date distribution.

Metadata statistics: The most obvious example of overall database statistics is cumulative statistics for metadata. For instance, the distribution of dates and document lengths within the collection could help place the result set in context.
Term frequency: Term frequency within a collection can be an extremely important tool for introducing structure into a result set. Research has shown that somewhat rare, but not unique, terms often serve as the best classifiers [Wise99].

From the query:

List of query words: As mentioned above, one can reasonably infer that the terms mentioned explicitly in the user's query are significant and therefore should serve as part of the basis of the set structure.
Annotation or weighting of query terms: The RIM might also have access to special query annotations. For instance, many search engines interpret a plus sign in front of a term to mean that the presence of the term is mandatory. While the presence or absence of such a term would not be useful to distinguish among documents in the result set, the number of its occurrences might serve as a marker of relevance.

From the user:

Even after the documents have been returned, the RIM can still garner additional information from the user to help organize the results. This can be especially valuable, because the user can take into account the gross properties of the result set; e.g. does it appear that there are many relevant documents or just a few? Is one particular query term very heavily represented, or are there just a few instantiating documents?

Post hoc weighting of query terms: The user may be allowed to change the importance of query terms on the fly, and the presentation of the result set can be updated accordingly. Some early versions of NIRVE allowed this.
Aggregation of query terms: One of the more valuable techniques emerging from the NIRVE project was a mechanism by which users could group related terms into so-called concepts. E.g. in the NIRVE system, users could map the terms "tornado" and "twister" and "cyclone" to a single concept named "STORM". Thus, users could specify many terms in the query so as not to miss relevant documents, but then consolidate the terms into a smaller number of concepts so as to simplify the resulting structure.

2.2 Structuring the Result Set

Given (some of) the information above, what logical structure should the RIM then impose on the result set so as to maximize user comprehension? The answer, of course, depends on the nature of the documents, the user, and on the tasks to be performed, but we can imagine a spectrum of plausible approaches. The base case, a ranked list, is simply a one-dimensional sequence, ordered by estimated relevance to the query.

The most common strategy for various research prototypes has been to form clusters of documents. Normally, the clusters are exclusive and exhaustive, though they needn't be. The clusters are typically labelled in some way so as to help the user guess where the desired information is most likely to reside. Clustering is, however, only one of a number of possible structures. One can imagine assigning documents to nodes in a hierarchy. Or, documents, or clusters thereof, could be arranged in a network. This latter idea was the approach taken by NIRVE: clusters of documents were formed, based on the concepts they instantiated, and then these clusters were arranged so as to exhibit the relationship among them. Note that two kinds of relational information are involved: the relationship directly among documents (e.g. among the documents within a cluster) and the relationship among the document aggregates (e.g. a network or hierarchy of clusters).

In addition to the possibilities just mentioned, some systems also include the query terms themselves as components of the logical structure, as well as the returned documents. For instance, Veerasamay et al [Veer97] construct the many-to-many "contains" relation between documents and query terms.

2.3 Presentation of Results

Once a logical structure has been determined, there is a second design choice: how should the structure be visually displayed? Even in the simple case of a list, where the visual order normally corresponds to the logical order, there are still significant presentation issues: how much detail to exhibit per document? How many document entries to present on a single page or screen?

There are many visual paradigms for representing hierarchies from the basic (indented lists) to the elaborate (cone-trees [Hear97], tree-maps [Shne92]). A hierarchy is probably the structure best-suited to a purely textual representation. Most others seem to require some form of visualization. Veerasamy's system exhibits the document:query-term relation as a grid, in each cell of which is depicted the frequency of occurrence of the term in the document. Although implemented graphically, this could now probably be done purely within HTML, using tables. Clusters and networks have been presented in a wide variety of imaginative visual formats - again, see [Cugi00] for an overview and references. Note that sophisticated web-enabled visualizations (including 3-D and hyper-links) can be implemented using the Virtual Reality Modeling Language (VRML). It is certainly possible, however, to express a semantic network of documents within simple HTML by using links [Tebb99], [Golo97].

2.4 Manipulation of Results

The RIM can also allow users to directly manipulate the structure and presentation of the result set. As a simple example, NIRVE allowed users to mark individual documents and clusters of documents as good, bad or unsure. The user could then apply a filter to the set so as to display or suppress documents within these categories. Another familiar case is the ability to expand or compress sub-trees within a hierarchy.

More generally, the operations provided by the RIM should match the tasks typically associated with search results. Users should be able to discard irrelevant documents and groups of documents, annotate document entries, switch documents among labelled groups, save selected results, and so on.

3. Evaluation Issues

Once all the above design choices have been resolved, what issues arise when we try to evaluate the resulting interface?

3.1 Base Case and Isolation of Variables

Testing the presentation of search results is made somewhat easier because there does seem to be a widely acknowledged base case: the ranked list. The most straightforward approach, then, is to formulate a result set (or sets), along with some associated tasks, and measure how well these tasks are accomplished using the traditional ranked list as the control vs. the experimental approach. Of course one might also want to make a comparison among several innovative prototypes.

In the case of NIRVE, we did a study [Sebr99] among variations of the prototype: the documents of the result set were arranged into clusters which were presented in three ways: textually, as a 2-D visualization, and as a 3-D visualization. Thus, we deliberately did not measure the effectiveness of NIRVE vs. a ranked list, but rather performed a narrower measurement of the relative utility of various visualization styles.

Note in particular that researchers may wish to distinguish the benefit (if any) produced by the structuring of a result set as opposed to the presentation mode. E.g. a 3-D visualization of document clusters differs from a ranked list because of the clustering as well as the 3-D. Likewise, there are many ways to depict a hierarchy: the logical structure is the same although the appearance may differ radically. A well-crafted test can distinguish the effect of the hierarchy itself from that of its visual representation.

3.2 Log Data

Evaluation of interactive systems starts with log data: some record of the user's activity while trying to perform a task. There are three dimensions by which log data can be categorized.

Human/computer activity:: While log data normally makes us think of mouse-clicks, time-stamps, and page-jumps, recall that we can also record bodily movement, gaze direction, oral reports and so on. The point is that some significant user activity may not involve direct interaction with the system under test.
Manual/automated data collection:: The raw data for a test session may be collected automatically or by a human observer/recorder. Automatic data collection may be implemented by software to capture machine activity or by independent recording devices, such as microphones and video cameras.
Manual/automated derivation of metric from log data:: A summative number may be readily calculated from the data or human judgement may be required. E.g. a metric like "number of oral complaints about usability" may require human interpretation of an audio tape.

3.3 Metrics: Types and Interpretation

Devising suitable metrics for interactive systems (those in which the essential function of the software is to enable or enhance some human activity) is especially challenging, because of the semantic depth and variability of human performance. Even an apparently good set of metrics may not capture everything of interest. For instance, if no metric is sensitive to "excessive" mouse motion recorded in the log data, potentially useful information may be lost.

3.3.1 Metric Content: Performance and Satisfaction

While most of our discussion concerns the measurement of functional performance, such as speed and accuracy of results, we should remember that subjective user satisfaction is also an important property of interactive systems, albeit not one for which automatic measurement is easy. There are widely-accepted instruments for assessing user satisfaction, such as the Questionnaire for User Interaction Satisfaction [QUIS] from the University of Maryland.

3.3.2 Metric Level

One of the most important dimensions when trying to assess the effectiveness of an interactive system is the level of detail of the metric. High-level metrics are those which try to capture broad properties of a system: its overall functional performance or subjective user satisfaction. They tend to be more portable across various implementations targeting the same task. Performance metrics for presenting search results might include:

Percentage of relevant documents found within a given time, when the task is to find as many as possible
Relative error of response, when the task is to estimate quickly how many documents in the result set are relevant
Relevance score of a selected document, when the task is to find the "best" one
Time taken to find one relevant document
Time taken to answer a specific question

Low-level metrics are those which capture the details of user interaction with a system. They are more directly based on the log data and therefore perhaps easier to automate, but they may be more difficult to interpret. Also, they are more dependent on the specific properties of the system under test. Some search result examples:

depth of tree search
total number of network nodes or clusters traversed
time taken for each network node traversed
path length of search

The following table summarizes the distinctions between high-level and low-level metrics.

High-level Low-level

Measures some broad property of user session Measures details of user session
Result-oriented (what got done, how fast?). Summarizes overall performance, but does not explain why. Method-oriented (how did it get done?). More useful for in-depth diagnosis.
Treat implementation as a black box, hence less dependence on specifics of prototype May make sense only for a specific implementation, hence less valuable for comparing systems.
Easier to interpret More need for integrative analysis

High-level	Low-level
Measures some broad property of user session	Measures details of user session
Result-oriented (what got done, how fast?). Summarizes overall performance, but does not explain why.	Method-oriented (how did it get done?). More useful for in-depth diagnosis.
Treat implementation as a black box, hence less dependence on specifics of prototype	May make sense only for a specific implementation, hence less valuable for comparing systems.
Easier to interpret	More need for integrative analysis

3.3.3 Interpretation of Metrics

A well-designed test will result in a value assigned to each of a number of metrics and these results presumably capture some important information about the object under test. There is always an issue of the interpretation of a metric (does "average time for task completion" measure "ease of use"?) and the selection of a good set of metrics is by no means a trivial chore. Moreover, the "goodness" of a metric is relative to the intended task and usage. For example, one way of presenting search results may work well if the intended task is browsing, but not for directed searching - and therefore different metrics may be called for. When there is sufficient commonality among a class of related applications, certain metrics may gain widespread acceptance, e.g. the metrics defined by the TREC [TREC] process to measure the effectiveness of search engines.

3.4 Generic Test Support

Part of NIST's stock in trade is the provision of common test frameworks by which emerging technologies can be evaluated. Testing the presentation of search results would seem to fit this paradigm. As mentioned earlier, the interactive track of TREC [INTREC] has done related work, although they have not (yet) isolated the sub-task of user manipulation of a given result set from the broader task of working with a search engine.

A common test framework implies some agreement on the nature of the relevant tasks for this class of application. Presumably, such tasks would include identifying relevant documents, answering specific questions (who was the Governor of Iowa in 1988?), gathering information about a topic area (find all examples of tornados in winter), and also about the result set itself (how many relevant documents are there?). Nonetheless, formulating a good set of metrics remains a challenging design problem.

It would seem wise for a common test set to concentrate on certain kinds of log data and metrics. Such a set should maximize the automation of data collection and analysis. Also, since the set should apply to (and allow comparability among) a wide variety of paradigms, it must probably be limited to high-level metrics. The components of such a test set would include:

Several result sets, with the documents given in some standard format. This would include full text and metadata.
Associated document data, such as full term vectors, and estimated relevance scores (as if from a search engine).
A set of topics and the associated queries that supposedly generated the result set.
For each topic and result set, some definitive human judgment about the relevance of each document (i.e. its "real" relevance score, not just that estimated by the search engine).
Perhaps some characterization of the virtual database from which the result set was drawn, such as term frequency and length distribution.
A set of tasks (of various types) to be posed to the human subjects in the test sessions. There should also be a known "right answer" against which actual responses can be compared, automatically if possible.
Support software for data logging and analysis.

4. Conclusion

Although there has been a good deal of research, the ranked list paradigm still dominates the searching and browsing activity of most users. Only a very few innovative prototypes have been carefully measured against and found superior to this common interface. It seems reasonable to suppose that a standard and widely available test methodology will spur research in this important area.

Acknowledgments

I thank my NIST colleagues for their help and advice: Sharon Laskowski, Donna Harman, Emile Morse, John Tebbutt, Carolyn Schmidt, and Paul Hsiao.

References

[Cugi00] J. Cugini, S. Laskowski, M. Sebrechts, "Design of 3D Visualization of Search Results: Evolution and Evaluation", Proceedings of IST/SPIE's 12th Annual International Symposium: Electronic Imaging 2000: Visual Data Exploration and Analysis (SPIE 2000), San Jose, CA, 23-28 January 2000. See http://www.itl.nist.gov/iaui/vvrg/cugini/uicd/nirve-home.html

[Golo97] G. Golovchinsky, "Queries? Links? Is there a difference?"", Proceedings of CHI'97, pp. 407-414, Atlanta, GA, 22-27 March 1997. See http://www.acm.org/sigchi/chi97/proceedings/paper/gxg.htm

[Hear97] M. Hearst and C. Karadi, "Cat-a-Cone: An Interactive Interface for Specifying Searches and Viewing Retrieval Results using a Large Category Hierarchy", Proceedings 20th Annual International ACM SIGIR Conference, Philadelphia, PA, July 27-31, 1997.

[IAD] Details of the Division's program may be seen at: http://www.itl.nist.gov/iaui/

[INTREC] For recent work, see: http://trec.nist.gov/pubs/trec7/papers/index.track.html#interactive
http://trec.nist.gov/pubs/trec8/papers/index.track.html#interactive
http://www-nlpir.nist.gov/projects/t7i/t7i.html
http://www-nlpir.nist.gov/projects/t8i/t8i.html

[IUSR] I-USR Project: http://www.nist.gov/iusr/

[Lawr99] S. Lawrence and L. Giles, "Accessibility of information on the web", Nature, Vol. 400, pp. 107-109, 1999. See http://wwwmetrics.com/.

[NIRVE] NIRVE Home Page: http://www.itl.nist.gov/iaui/vvrg/cugini/uicd/nirve-home.html

[QUIS] QUIS: http://www.cs.umd.edu/hcil/quis/

[Sebr99] M. Sebrechts, J. Vasilakis, M. Miller, J. Cugini, S. Laskowski, "Visualization of Search Results: A Comparative Evaluation of Text, 2D, and 3D Interfaces", 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, California, August 1999. See http://www.itl.nist.gov/iaui/vvrg/cugini/uicd/nirve-home.html

[Shne92] Ben Shneiderman, "Tree visualization with tree-maps: 2-d space-filling approach", ACM Transactions on Graphics, 11(1) (1992), pp. 92-99.

[Tebb99] John Tebbutt, "User Evaluation of Automatically Generated Semantic Hypertext Links in a Heavily Used Procedural Manual", Information Processing and Management, 35(1) (1999), pp. 1-18. See: http://www.itl.nist.gov/iaui/894.02/works/papers/user_eval.html

[TREC] Text Retrieval Conference: http://trec.nist.gov

[Veer97] Aravindan Veerasamy and Russell Heikes, "Effectiveness of a graphical display of retrieval results", Proceedings 20th Annual International ACM SIGIR Conference, Philadelphia, PA, July 27-31, 1997.

[WEBM] WebMetrics: http://www.nist.gov/webmet/

[Wise99] J. A. Wise, "The Ecological Approach to Text Visualization", Journal of the American Society for Information Science, November 1999, v50, no.13.

Presenting Search Results: Design, Visualization, and Evaluation

John Cugini / cuz@nist.gov Information Technology Laboratory National Institute of Standards and Technology (NIST) Gaithersburg, MD 20899