Presenting Search Results:
Design, Visualization, and Evaluation
John Cugini / cuz@nist.gov
Information Technology Laboratory
National Institute of Standards and Technology
(NIST)
Gaithersburg, MD 20899
Contribution of the National Institute of Standards and Technology.
Not subject to copyright. Reference to specific commercial products
or brands is for information purposes only; no endorsement or
recommendation by the National Institute of Standards and Technology,
explicit or implicit, is intended.
Submitted to the
Information Doors -- Where Information Search and Hypertext Link
workshop, May 30, 2000, San Antonio, Texas (held in conjunction with
the ACM Hypertext and Digital Libraries conferences).
This paper is accessible at:
http://www.itl.nist.gov/iaui/vvrg/cugini/irlib/paper-may2000.html
Abstract
A number of projects at the
National Institute of
Standards and Technology (NIST) have addressed the generation,
evaluation, and presentation of search results. This paper contains a
general characterization of the presentation problem, and an outline
of the components of an associated evaluation system. Within the
design space for this problem, we distinguish between the logical
structure imposed on the result set and the interface by which the
structured results are presented to the user. This interface includes
the operations provided for the manipulation of the set as well as its
visual presentation. Any design, no matter how intuitively appealing,
should be evaluated and the full array of issues for HCI testing then
comes into play. In particular, researchers must decide on a base
case against which to measure results, whether to use high-level
and/or low-level metrics, and which tasks are appropriate for the
evaluation.
Keywords
Search engine, search results, user interface, information retrieval,
visualization, usability evaluation.
1. Relevant NIST Experience and Projects
The major mission of the Information Access
Division [IAD] of NIST is to accelerate the
development of technologies that allow intuitive and efficient access,
manipulation, and exchange of complex information. We contribute to
those goals primarily via measurement methods and standards. Several
of our projects, described below, are particularly relevant to the
problem of designing an interface by which users can examine and
manipulate the set of documents resulting from an automatic search.
In this paper, we present a general characterization of this problem
and we propose a method for systematic evaluation of such interfaces.
One of the lessons of our experience is that no matter how much
intuitive appeal a given interface might have, without some systematic
testing, its real value remains unknown. Especially in the field of
visualization, it is all too common for technical wizardry to be
unaccompanied by any real gain in efficiency. Several IAD programs
have demonstrated the value of applying a consistent set of tests to
emerging technologies. Evaluation is a crucial link in the feedback
loop. It focuses the development effort and gives some sense of where
progress is being made.
-
The annual Text REtrieval Conference [TREC] has
become the premiere forum for the quantitative evaluation of search
engines. Note that this is primarily a comparison of the quality of
result sets returned by various engines for a given query, not of the
mode of presentation. However, for the past few years the
interactive track [INTREC] of TREC has
experimented with methods for evaluating the user interface.
-
For the past four years, the NIST Information Retrieval
Visualization Engine [NIRVE] project has explored
the potential value of 3-D visualization in helping users understand
and manipulate search results. One paper emerging from this project
describes the design of NIRVE and also presents a brief survey of
other attempts to visualize result sets [Cugi00].
Another describes an in-depth user evaluation performed on variants of
NIRVE [Sebr99].
-
Two projects directly address usability testing. The I-USR [IUSR] project is developing standards for reporting
the results of testing with users to allow sharing of usability data
between software development and purchasing companies. The WebMetrics
[WEBM] project researches and develops software
prototypes, notably including tools for logging user activity and
visualizing the resulting log files, in order to support rapid,
remote, and automated usability testing of web-based applications,
-
In a consulting project for the Social Security Administration, NIST
researchers developed a system for the automatic generation of
semantic hyperlinks to allow easier navigation within a large body of
text. Furthermore, this was followed up by a user study to help
evaluate and refine the prototype [Tebb99].
-
Finally, we have recently undertaken a project called IRLIB to address
a problem that has long plagued scientific communities: limited access
to older citations. The main goals of IRLIB are to determine how best
to make this material accessible and to investigate methods for remote
testing of the usability of digital libraries. During late 1998 and
early 1999 IAD built a very small-scale digital library of some older
citations of interest to the IR community, consisting of six
proceedings, one book, and one NIST monograph. These were scanned,
using OCR, and made web-accessible for searching.
2. Design Framework for Search Results
Can we do better than ranked lists? The intuition that the answer
must be "yes" has motivated several research projects in
visualization, as mentioned above [Cugi00].
We set aside the problem of producing a better result set and take the
actual result set as a given. How then should the set be organized and
presented so as to allow users to accomplish their goals?
2.1 Information Resources
Let us start by listing the information that may be available as input
to some automated process, which we will refer to as the result
interface manager (RIM).
- From each resulting document:
-
- Query term occurrence
-
At the least it would be useful for the
RIM to know which subset of query terms occurs in each document. It
would be better still if the search engine returned a count of the
number of occurrences of each query term. Since these are the terms
the user thought important enough to put in the query in the first
place, it seems reasonable to assume that they should help to
determine the way in which the result set is structured. Some search
engines, however, do not return this information.
- Full term vector
-
For a managed digital library, the complete
term vector (i.e. including all significant terms, not just those
matching the query) of a document gives a more complete
characterization and may be pre-computed. For an unmanaged
collection, such as the entire web, acquiring this information is more
difficult.
- Full text
-
Of course, the logical extreme is to make the full
text of the returned articles available to the RIM; in theory, this
allows the deepest analysis and organization of the results,
e.g. via natural language processing techniques. Its
practicality is unclear however, first because of the time needed to
relay full text to the RIM (especially if over the web), second
because of the time needed to analyze and compare the text of many
documents. Nonetheless, this approach is being used by
some meta-search engines [Lawr99].
- Score assigned by search engine
-
Most engines return some sort of metric estimating how well the
document matches the query; indeed, this statistic is usually the
basis for the ordering of the returned list. While innovative
approaches seek to transcend this single statistic, it nonetheless
provides potentially useful information.
- Metadata
-
There will often be specific fields of data associated
with an article, such as title, length, date, dateline
(i.e. location), or author and clearly these may contribute to the
user interface. While the length of a returned document may itself
not be of primary interest, it may be useful for normalizing query
term statistics, so as not to give undue weight to long documents.
Note that a title often carries great semantic weight, since it
supposedly identifies the main theme. Perhaps it would be worthwhile
to apply some natural language techniques to just the title?
- From the document database:
-
Global information about the database from which the documents are
drawn can help the RIM place the individual entries in context. For
instance, if the user wants "recent" documents, the actual cutoff date
used would depend on the date distribution.
- Metadata statistics
-
The most obvious example of overall database statistics is
cumulative statistics for metadata. For instance, the distribution
of dates and document lengths within the collection could help
place the result set in context.
- Term frequency
-
Term frequency within a collection can be an extremely important tool
for introducing structure into a result set. Research has shown that
somewhat rare, but not unique, terms often serve as the best
classifiers [Wise99].
- From the query:
-
- List of query words
-
As mentioned above, one can reasonably infer that the terms mentioned
explicitly in the user's query are significant and therefore should
serve as part of the basis of the set structure.
- Annotation or weighting of query terms
-
The RIM might also have access to special query annotations. For
instance, many search engines interpret a plus sign in front of a term
to mean that the presence of the term is mandatory. While the
presence or absence of such a term would not be useful to distinguish
among documents in the result set, the number of its occurrences might
serve as a marker of relevance.
- From the user:
-
Even after the documents have been returned, the RIM can still garner
additional information from the user to help organize the results.
This can be especially valuable, because the user can take into
account the gross properties of the result set; e.g. does it appear
that there are many relevant documents or just a few? Is one
particular query term very heavily represented, or are there just a
few instantiating documents?
- Post hoc weighting of query terms
-
The user may be allowed to change the importance of query terms on the
fly, and the presentation of the result set can be updated
accordingly. Some early versions of NIRVE allowed this.
- Aggregation of query terms
-
One of the more valuable techniques emerging from the NIRVE project
was a mechanism by which users could group related terms into
so-called concepts. E.g. in the NIRVE system, users could
map the terms "tornado" and "twister" and "cyclone" to a single
concept named "STORM". Thus, users could specify many terms in the
query so as not to miss relevant documents, but then consolidate the
terms into a smaller number of concepts so as to simplify the
resulting structure.
2.2 Structuring the Result Set
Given (some of) the information above,
what logical structure should the RIM then impose on
the result set so as to maximize user comprehension? The answer, of
course, depends on the nature of the documents, the user, and on the
tasks to be performed, but we can imagine a spectrum of plausible
approaches. The base case, a ranked list, is simply a one-dimensional
sequence, ordered by estimated relevance to the query.
The most common strategy for various research prototypes has been to
form clusters of documents. Normally, the clusters are
exclusive and exhaustive, though they needn't be. The clusters are
typically labelled in some way so as to help the user guess where the
desired information is most likely to reside. Clustering is, however,
only one of a number of possible structures. One can imagine
assigning documents to nodes in a hierarchy. Or, documents, or clusters
thereof, could be arranged in a network. This latter idea was the
approach taken by NIRVE: clusters of documents were formed, based on
the concepts they instantiated, and then these clusters were arranged
so as to exhibit the relationship among them. Note that two kinds of
relational information are involved: the relationship directly among
documents (e.g. among the documents within a cluster) and the
relationship among the document aggregates (e.g. a network or
hierarchy of clusters).
In addition to the possibilities just mentioned, some systems
also include the query terms themselves as components of the
logical structure, as well as the returned documents. For instance,
Veerasamay et al [Veer97] construct the
many-to-many "contains" relation between documents and query terms.
2.3 Presentation of Results
Once a logical structure has been determined, there is a second design
choice: how should the structure be visually displayed? Even in the
simple case of a list, where the visual order normally
corresponds to the logical order, there are still significant
presentation issues: how much detail to exhibit per document? How
many document entries to present on a single page or screen?
There are many visual paradigms for representing hierarchies from the
basic (indented lists) to the elaborate (cone-trees
[Hear97], tree-maps
[Shne92]). A hierarchy is probably the structure
best-suited to a purely textual representation. Most others seem to
require some form of visualization. Veerasamy's system exhibits the
document:query-term relation as a grid, in each cell of which is
depicted the frequency of occurrence of the term in the document.
Although implemented graphically, this could now probably be done
purely within HTML, using tables. Clusters and networks have been
presented in a wide variety of imaginative visual formats - again, see
[Cugi00] for an overview and references. Note
that sophisticated web-enabled visualizations (including 3-D and
hyper-links) can be implemented using the Virtual Reality
Modeling Language (VRML). It is certainly possible,
however, to express a semantic network of documents within simple HTML
by using links [Tebb99],
[Golo97].
2.4 Manipulation of Results
The RIM can also allow users to directly manipulate the structure and
presentation of the result set. As a simple example, NIRVE allowed
users to mark individual documents and clusters of documents
as good, bad or unsure. The user could then apply a filter to the set
so as to display or suppress documents within these categories.
Another familiar case is the ability to expand or compress
sub-trees within a hierarchy.
More generally, the operations provided by the RIM should
match the tasks typically associated with search results.
Users should be able to discard irrelevant documents and
groups of documents, annotate document entries,
switch documents among labelled groups,
save selected results, and so on.
3. Evaluation Issues
Once all the above design choices have been resolved, what issues
arise when we try to evaluate the resulting interface?
3.1 Base Case and Isolation of Variables
Testing the presentation of search results is made somewhat easier
because there does seem to be a widely acknowledged base case: the
ranked list. The most straightforward approach, then, is to formulate
a result set (or sets), along with some associated tasks, and measure
how well these tasks are accomplished using the traditional ranked
list as the control vs. the experimental approach. Of course one
might also want to make a comparison among several innovative
prototypes.
In the case of NIRVE, we did a study [Sebr99]
among variations of the prototype: the documents of the result set
were arranged into clusters which were presented in three ways:
textually, as a 2-D visualization, and as a 3-D visualization. Thus,
we deliberately did not measure the effectiveness of NIRVE vs. a
ranked list, but rather performed a narrower measurement of the
relative utility of various visualization styles.
Note in particular that researchers may wish to distinguish the
benefit (if any) produced by the structuring of a result set
as opposed to the presentation mode. E.g. a 3-D
visualization of document clusters differs from a ranked list because
of the clustering as well as the 3-D. Likewise, there are many ways
to depict a hierarchy: the logical structure is the same although the
appearance may differ radically. A well-crafted test can distinguish
the effect of the hierarchy itself from that of its visual
representation.
3.2 Log Data
Evaluation of interactive systems starts with log data: some record of
the user's activity while trying to perform a task. There are three
dimensions by which log data can be categorized.
- Human/computer activity:
-
While log data normally makes us think of mouse-clicks, time-stamps,
and page-jumps, recall that we can also record bodily movement, gaze
direction, oral reports and so on. The point is that some significant
user activity may not involve direct interaction with the
system under test.
- Manual/automated data collection:
-
The raw data for a test session may be collected automatically or by a
human observer/recorder. Automatic data collection may be implemented
by software to capture machine activity or by independent recording
devices, such as microphones and video cameras.
- Manual/automated derivation of metric from log data:
-
A summative number may be readily calculated from the data or human
judgement may be required. E.g. a metric like "number of oral
complaints about usability" may require human interpretation of an
audio tape.
3.3 Metrics: Types and Interpretation
Devising suitable metrics for interactive systems (those in which the
essential function of the software is to enable or enhance some human
activity) is especially challenging, because of the semantic depth and
variability of human performance. Even an apparently good set of
metrics may not capture everything of interest. For instance, if no
metric is sensitive to "excessive" mouse motion recorded in the log
data, potentially useful information may be lost.
3.3.1 Metric Content: Performance and Satisfaction
While most of our discussion concerns the measurement of functional
performance, such as speed and accuracy of results, we should remember
that subjective user satisfaction is also an important property of
interactive systems, albeit not one for which automatic measurement is
easy. There are widely-accepted instruments for assessing user
satisfaction, such as the Questionnaire for User Interaction
Satisfaction [QUIS] from the University of
Maryland.
3.3.2 Metric Level
One of the most important dimensions when trying to assess the
effectiveness of an interactive system is the level of detail of the
metric. High-level metrics are those which try to capture broad
properties of a system: its overall functional performance or
subjective user satisfaction. They tend to be more portable across
various implementations targeting the same task. Performance metrics
for presenting search results might include:
-
Percentage of relevant documents found within a given time,
when the task is to find as many as possible
-
Relative error of response, when the task is to estimate quickly how
many documents in the result set are relevant
-
Relevance score of a selected document, when the task is to find the
"best" one
- Time taken to find one relevant document
- Time taken to answer a specific question
Low-level metrics are those which capture the details of user
interaction with a system. They are more directly based on the log
data and therefore perhaps easier to automate, but they may be more
difficult to interpret. Also, they are more dependent on the specific
properties of the system under test. Some search result examples:
-
depth of tree search
-
total number of network nodes or clusters traversed
-
time taken for each network node traversed
-
path length of search
The following table summarizes the distinctions between
high-level and low-level metrics.
High-level
| Low-level
|
Measures some broad property of user session
|
Measures details of user session
Result-oriented (what got done, how fast?).
Summarizes overall performance, but does not explain why.
|
Method-oriented (how did it get done?).
More useful for in-depth diagnosis.
Treat implementation as a black box, hence less dependence on
specifics of prototype
|
May make sense only for a specific implementation,
hence less valuable for comparing systems.
Easier to interpret
|
More need for integrative analysis
| | | |
3.3.3 Interpretation of Metrics
A well-designed test will result in a value assigned to each of a
number of metrics and these results presumably capture some important
information about the object under test. There is always an issue of
the interpretation of a metric (does "average time for task
completion" measure "ease of use"?) and the selection of a good set of
metrics is by no means a trivial chore. Moreover, the "goodness" of a
metric is relative to the intended task and usage. For example, one
way of presenting search results may work well if the intended task is
browsing, but not for directed searching - and therefore different
metrics may be called for. When there is sufficient commonality among
a class of related applications, certain metrics may gain widespread
acceptance, e.g. the metrics defined by the TREC [TREC]
process to measure the effectiveness of search engines.
3.4 Generic Test Support
Part of NIST's stock in trade is the provision of common test
frameworks by which emerging technologies can be evaluated. Testing
the presentation of search results would seem to fit this paradigm.
As mentioned earlier, the interactive track of TREC [INTREC] has done related work, although they have
not (yet) isolated the sub-task of user manipulation of a given result
set from the broader task of working with a search engine.
A common test framework implies some agreement on the nature of the
relevant tasks for this class of application. Presumably, such tasks
would include identifying relevant documents, answering specific
questions (who was the Governor of Iowa in 1988?), gathering
information about a topic area (find all examples of tornados in
winter), and also about the result set itself (how many relevant
documents are there?). Nonetheless, formulating a good set of metrics
remains a challenging design problem.
It would seem wise for a common test set to concentrate on
certain kinds of log data and metrics. Such a set should maximize the
automation of data collection and analysis. Also, since the
set should apply to (and allow comparability among) a wide
variety of paradigms, it must probably be limited to high-level
metrics. The components of such a test set would include:
-
Several result sets, with the documents given in some standard format.
This would include full text and metadata.
-
Associated document data, such as full term vectors, and
estimated relevance scores (as if from a search engine).
-
A set of topics and the associated queries that supposedly generated
the result set.
-
For each topic and result set, some definitive human judgment
about the relevance of each document (i.e. its "real"
relevance score, not just that estimated by the search engine).
-
Perhaps some characterization of the virtual database
from which the result set was drawn, such as term frequency
and length distribution.
-
A set of tasks (of various types) to be posed to the human subjects in
the test sessions. There should also be a known "right answer"
against which actual responses can be compared, automatically
if possible.
-
Support software for data logging and analysis.
4. Conclusion
Although there has been a good deal of research, the ranked list
paradigm still dominates the searching and browsing activity of most
users. Only a very few innovative prototypes have been carefully
measured against and found superior to this common interface. It
seems reasonable to suppose that a standard and widely available test
methodology will spur research in this important area.
Acknowledgments
I thank my NIST colleagues for their help and advice: Sharon
Laskowski, Donna Harman, Emile Morse, John Tebbutt, Carolyn Schmidt,
and Paul Hsiao.
References
[Cugi00]
J. Cugini, S. Laskowski, M. Sebrechts,
"Design of 3D Visualization of Search Results: Evolution and Evaluation",
Proceedings of IST/SPIE's 12th Annual International Symposium:
Electronic Imaging 2000: Visual Data Exploration and Analysis
(SPIE 2000),
San Jose, CA, 23-28 January 2000. See
http://www.itl.nist.gov/iaui/vvrg/cugini/uicd/nirve-home.html
[Golo97]
G. Golovchinsky, "Queries? Links? Is there a difference?"",
Proceedings of CHI'97, pp. 407-414,
Atlanta, GA, 22-27 March 1997.
See
http://www.acm.org/sigchi/chi97/proceedings/paper/gxg.htm
[Hear97]
M. Hearst and C. Karadi,
"Cat-a-Cone: An Interactive Interface for Specifying Searches and
Viewing Retrieval Results using a Large Category Hierarchy",
Proceedings 20th Annual International ACM SIGIR Conference,
Philadelphia, PA, July 27-31, 1997.
[IAD]
Details of the Division's program may be seen at:
http://www.itl.nist.gov/iaui/
[INTREC]
For recent work, see:
http://trec.nist.gov/pubs/trec7/papers/index.track.html#interactive
http://trec.nist.gov/pubs/trec8/papers/index.track.html#interactive
http://www-nlpir.nist.gov/projects/t7i/t7i.html
http://www-nlpir.nist.gov/projects/t8i/t8i.html
[IUSR]
I-USR Project:
http://www.nist.gov/iusr/
[Lawr99]
S. Lawrence and L. Giles,
"Accessibility of information on the web", Nature,
Vol. 400, pp. 107-109, 1999.
See http://wwwmetrics.com/.
[NIRVE]
NIRVE Home Page:
http://www.itl.nist.gov/iaui/vvrg/cugini/uicd/nirve-home.html
[QUIS]
QUIS:
http://www.cs.umd.edu/hcil/quis/
[Sebr99]
M. Sebrechts, J. Vasilakis, M. Miller, J. Cugini, S. Laskowski,
"Visualization of Search Results: A Comparative Evaluation of Text,
2D, and 3D Interfaces",
22nd International ACM SIGIR Conference on Research and
Development in Information Retrieval,
Berkeley, California, August 1999. See
http://www.itl.nist.gov/iaui/vvrg/cugini/uicd/nirve-home.html
[Shne92]
Ben Shneiderman,
"Tree visualization with tree-maps: 2-d space-filling approach",
ACM Transactions on Graphics,
11(1) (1992), pp. 92-99.
[Tebb99]
John Tebbutt, "User Evaluation of Automatically Generated Semantic
Hypertext Links in a Heavily Used Procedural Manual",
Information Processing and Management, 35(1) (1999), pp. 1-18.
See:
http://www.itl.nist.gov/iaui/894.02/works/papers/user_eval.html
[TREC]
Text Retrieval Conference:
http://trec.nist.gov
[Veer97]
Aravindan Veerasamy and Russell Heikes, "Effectiveness of a
graphical display of retrieval results",
Proceedings 20th Annual International ACM SIGIR Conference,
Philadelphia, PA, July 27-31, 1997.
[WEBM]
WebMetrics:
http://www.nist.gov/webmet/
[Wise99]
J. A. Wise, "The Ecological Approach to Text Visualization",
Journal of the American Society for Information Science,
November 1999, v50, no.13.