Methodology

Evaluating Visual Information Browsing Displays

Emile Morse

October 6, 1998

Table of Contents

1 Introduction *

1.1 Overview *

1.2 Problem Statement *

1.3 Definitions *

1.3.1 Information Retrieval *

1.3.2 Visualization types *

1.3.3 Display types *

1.4 Hypotheses *

1.5 Limitations *

1.6 Assumptions *

2 Background Literature *

2.1 Visual Interfaces for Information Retrieval and Browsing *

2.1.1 Description of Selected Visualizations *

2.1.2 Analysis of Reference Point Visualizations *

2.2 Evaluation of Visual Interfaces for Information Retrieval and Browsing *

2.3 Task Models *

2.3.1 Domain-Dependent models *

2.3.2 Domain-Independent Visual Taxonomies *

2.3.3 Summary of Task Models *

2.4 Initial Studies *

2.4.1 2-term Boolean *

2.4.2 3-term Boolean *

3 Methodology *

3.1 Document Vector Data *

3.2 Subjects *

3.3 Mode of administration *

3.4 Independent Variables *

3.4.1 Display type *

3.4.2 Order of presentation *

3.4.3 Task type *

3.4.4 Difficulty of task setting *

3.5 Dependent Measures *

3.6 Covariates *

3.7 Statistical Analysis *

Appendix *

4 Bibliography *

List of Tables

Table 1: Reference Point Visualizations *

Table 2: Information Seeking Dimensions (Belkin et al 1995) *

Table 3: Visual Task Taxonomy (Zhou & Feiner 1998) *

Table 4: Visual Implications and related elemental tasks *

Table 5: Effect of display type on performance *

List of Figures

Figure 1: Multidimensional scaling results of Lohse (1990) showing visual display types. *

Figure 2: Examples of the Visual Elements of the Reference Point Systems *

Figure 3: Samples of presentation types: Panel A is a 'text list'; B is an 'icon list'; D is a 'table'; D is a 'graph'; and E is a 'Vibe' or 'spring' display. *

Figure 4: Effect of order of presentation of various displays on performance of OR task (Morse et al 1998) *

1 Introduction
1.1 Overview

Researchers in information retrieval (IR) have long searched for ways to make their systems more accessible to end users and to develop new ways for users to explore data. Visualization techniques (computer methods for displaying large quantities of information graphically) appear promising as a means for achieving both goals. Information visualization can make multidimensional relationships that are difficult to extract from tabular data apparent to a trained searcher. However, unlike scientific visualizations which are largely developed and used within elite specialties, IR visualizations are targeted toward guiding the public through newly accessible oceans of online information as well. Users employ many strategies when engaged in information seeking including bibliographical search, analytical search, search by analogy, empirical search, browsing, and check routine (Pejtersen 1999). Each of these activities might be augmented by using visualizations but browsing and analytical search are the strategies cited most frequently as benefiting from visual support (Marchionini 1995). According to Lin (1997), browsing is a superior strategy when

there is good underlying structure so that items close to one another can be inferred to be similar
users are unfamiliar with the contents of a collection
users have limited understanding of how a system is organized and prefer the less cognitively loaded method of exploration
users have difficulty verbalizing their underlying information need
information is easier to recognize that to describe.

Work on IR visualization systems is at a relatively early stage. Systems such as Bead (Chalmers 1996), InfoCrystal (Spoerri 1993), and LyberWorld (Hemmje 1994), have recently been developed as visual information exploration tools to aid in retrieval tasks. Researchers at the University of Pittsburgh have contributed to the development of information visualization systems with VIBE (Olsen et al 1993), GUIDO (Nuchprayoon 1996), and BIRD (Kim & Korfhage 1994).

Previous user study research into IR visualization systems has largely focused on formative usability evaluations to provide feedback for system enhancement. Newby (1992) tested subjects with the SPACE IR visualization system and standard text-based system called Prism. His findings showed that subjects preferred the Prism system for many of the tasks performed. Spoerri (1993) tested InfoCrystal’s interface with a standard Boolean interface. Koshman (1996)conducted extensive comparisons between VIBE and AskSAM, a commercially available text-based retrieval interface, using expert online searchers recruited from local libraries and novice searchers. Results showed that performance differences between AskSAM and VIBE were minimal. Novices and experts performed the test tasks with similar speed and accuracy. There was some evidence that particular tasks were better suited to performance in the graphical environment while others were more appropriate for the text-based tool.

Results of preference measures were disturbing in that users in both groups rated the textual interface slightly higher while the investigator states that she had expected a preference for the graphical interface whatever its effect on performance. She further suggests that the expanded capabilities offered by visual browsing are masked in comparisons of this sort which limit tasks to those which can be performed using text-based interfaces. For tasks expressible as Boolean queries a text-based interface may be both more direct and less complicated and less difficult to learn than a full featured VIRI.

Fully-featured visualizations are complex to evaluate for several reasons. First, choosing a control system is difficult. Second, performance is related not only to the tasks that are performed but to the mode of interaction and to the choice of feature for performance of the task. Dissecting out the critical factor is impossible. Third, the presence of many features requires considerable training so that subjects are fully capable of selecting the proper tool for a task. Fourth, tasks are normally devised to test features of the interface rather than to test tasks that occur in a natural information-seeking setting.

How could a study be designed to test visualizations in relative isolation? We first pursued the idea of ‘defeaturing’ interfaces so that users could learn the remaining core functions quickly (Morse & Lewis, 1997). Our preliminary usability evaluations demonstrated that training problems, comprehension problems, performance problems and "ratings" problems seemed to be diminished for the de-featured interfaces (Morse & Lewis 1997).

An even more rigorous de-featuring is possible -- one in which just the visualization remains. This 'back to basics' approach guides the overall development of this proposal including the preliminary studies based on Boolean data. It is bottom-up testing paradigm. If suitable sets of displays can be devised and if appropriate tasks can be developed, then it should be possible to test increasingly more complex situations and compare the results within a level of difficulty as well as across levels. Therefore, a plan has been developed to examine simple Boolean displays followed by simple vector displays. Preliminary studies have been performed in the Boolean mode that will be presented in Section 2. Both 2-term and 3-term conditions were tested in separate experiments and the results are supportive of the approach.

The next level of difficulty would be to move to a vector representation of documents. Weighted vectors underlie most modern retrieval systems. Documents in these systems are represented as elements of term occurrences. Often adjustments of various types are applied to the vectors to account for document length or other factors, but each document is characterized as a collection of numeric values. Text-based systems can use these weights to compare with user queries but few systems support users in constructing weighted queries. Visual representations can show users all the documents that contain any of the terms, which documents have which terms and information about the weighting of the terms.

1.2 Problem Statement

Information retrieval has been an interesting topic since long before computers existed. After the advent of computers, IR was among the first areas to be explored that was not based solely on the number-crunching capabilities of the machine. As computers advanced in power and in ability to present graphics, IR systems incorporated graphic features. From the mid-80's, visualizations have been developed to assist the users of interactive IR systems to satisfy their information needs. The advent of the Internet has provided additional pressure to develop efficient retrieval engines as well as effective methods for interfacing these engines with users. It is pertinent to note that users of the Internet and its IR tools form a more diverse audience than users of library-based retrieval systems. In summary, the need for powerful IR systems that can assist users with different levels of expertise and diverse interests has increased. In addition, visualizations have been incorporated into many IR systems but whether they can be understood by users or whether they can lead to more effective IR interaction is an open question.

The approach taken in this proposal to address the question of understandability of IR visualizations involves the testing of prototype displays. Initial studies have been conducted using these displays coupled with Boolean data and tasks based on Boolean combinations. These results are discussed in Section 2.4. The first study tested two-term displays and the second used three-term displays. The proposed extension involves the use of weighted vector data. Vector representations are the basis of most IR systems in use today. Vectors may be based on raw term frequencies, probabilities, or normalizations. Clearly, the Boolean tasks that were used in these preliminary studies are not suitable for testing vector document representations. Task typologies for visual display are available and will be used to develop the tasks for this set of experiments. The aim of this study is to determine whether users can understand displays based on vector representations.

Five types of displays will be tested -- word list, icon list, table, graph, and a visual display based on the VIBE placement algorithm (Olsen et al 1993). The tasks that will be applied to these displays are chosen from a set of domain-independent visual tasks. Examples of these tasks are: name, locate, describe, rank, and correlate.

1.3 Definitions

1.3.1 Information Retrieval

Information retrieval as a discipline concerns itself with the storage and access of information primarily in the form of text documents. The term information retrieval is also used in a more restrictive sense to talk about a particular kind of information seeking task, namely a combination of information access and document procurement. Other parallel activities have been variously categorized to include information organization, query formulation, query formulation and reformulation, and browsing. This list is not inclusive and evidence will be presented in Section 2.3 of the background material that will show that the types of activities that users engage in when seeking information need satisfaction may be categorized using a wide variety of schemes.

Visualizations used in information retrieval systems are usually thought to support browsing or inferencing or query reformulation. Rather than stating the list of potential activities each time that these terms might be used, this paper will often use the more general term, information retrieval.

1.3.2 Visualization types

Experimental information retrieval visualizations can be categorized in various ways. Lin (1997) suggests that there are four types—hierarchical, network, scatterplots and maps. For the purposes of this proposal a breakdown by dimensionality of the rendered data seems appropriate. In the simple visualizations proposed herein, it is most appropriate to imagine existing systems, such as VIBE (Olsen et 1993), BIRD (Kim & Korfhage 1994), GUIDO (Nuchprayoon & Korfhage 1994, Nuchprayoon 1996), InfoCrystal (Spoerri 1993), and the series of displays developed at NIST (Cugini et al 1996). All of these systems attempt to render just a few of the dimensions of the underlying hyperdimensional vector. Other systems that attempt to show relationships among much larger parts of the document vectors resemble maps. Examples of these systems are Lin’s self-organizing semantic maps (Lin 1991, 1997), SPIRE (Wise et al 1995), and BEAD (Chalmers 1996). Each of these interfaces is created by applying a dimension-reducing algorithm, such as simulated annealing, Kohonen maps, or latent semantic indexing to the full document vectors. According to Korfhage (Korfhage 1997), low-dimensional systems are properly called reference-point systems. Their primary characteristic is that a few points of interest (POIs) are used as anchor points in the display. The POIs are frequently keywords or query terms selected by the user, but may also represent full documents, user profiles or sets of terms or phrases. Map displays, on the other hand, tend to be used to represent the entire collection or document set. Some of the map systems allow the user to see sequential snapshots (SPIRE: Wise et al 1995) or to change the granularity of the detail (Lin 1997). Each of the map systems is dependent on complicated algorithms that limit their creation in an interactive environment. No doubt as computer systems become faster, the use of maps in interactive IR will increase in utility and variety.

1.3.3 Display types

A distinction is made in this paper between various types of displays. The terms that are used are text-based, word-based, tabular, graphical, and visual. A clarification of the use of these terms is useful. Text-based presentations show words in their usual semantic context. This is the usual form for text lists returned by Internet search engines. Word-based displays show each occurrence of the query term in a document listing. Tables are defined here as two-dimensional listings in which the values of the elements are numeric. Graphical displays are defined as the set of usual graph types, e.g., pie chart, bar chart, histogram, and scatterplot. Visual displays, in contrast, are composed of icons and connecting lines that do not have the normal Cartesian coordinate interpretation.

Lohse (1990) investigated how visual displays are categorized by subjects who sorted 40 display instances. The results of hierarchical clustering analysis of the data showed 5 clusters—icons, maps, diagrams, network charts, and graph and tables. Multidimensional scaling (MDS) of the same data revealed two dimensions, which Lohse (1990) named cognitive effort and discreteness of data. Both scales range from low to high. The types of displays chosen for investigation in the current proposal seem to correlate best with the tables, graphs, and diagrams. Figure 1 shows the MDS of Lohse and has been annotated to show the region covered by the proposed display types.

1.4 Hypotheses

The development of hypotheses for this study entails consideration of each of the dependent and independent variables. The dependent measures of performance are number of correct answers and time to completion of a task set, where a set refers to all the tasks for a single display type. The measure of preference is the user's rankings of each display. The independent variables are display type, order of presentation, individual task, and scenario difficulty. Scenario difficulty is defined as the number of terms depicted in a display, i.e., 2-term or 3-term. Subjects will perform the experimental tasks with a single level of difficulty.

The null form of the main hypothesis regarding display type and understandability is:

H1: Performance of tasks will not differ significantly regardless of display used.

H1a: The number of correct answers will not differ significantly regardless of display used.

H1b: Time to completion will not differ significantly regardless of display used.

Rejection of this hypothesis will indicate that there are differences with respect to understandability and usability of the displays. If the hypothesis is rejected, post-hoc testing will be performed to determine the differences. The data for both number of correct answers as well as time to completion will be subjected to this analysis.

If users perform differently with different displays, it is possible that the difference is due to the order in which the displays were presented. The null hypothesis statement is:

H2: The order of presentation of a display does not affect the correctness of answers.

H2a: The number of correct answers will not differ significantly regardless of order of the display presentation

H2b: Time to completion will not differ significantly regardless of the order of the display presentation..

If this hypothesis is rejected, it will provide evidence that users can learn how to perform tasks by using alternate display formats. Since the types of tasks being presented to users may not be the types of tasks that they are used to performing, it might be possible that there is some general trend to learning all displays. On the other hand, it might be the case that only displays that are difficult to use when presented early in the series are learned by using other displays, while some displays might be easy to use at first sight. Rejection of hypothesis H2 will allow subsequent analysis for these types of correlations.

To this point, the tasks for each display have been grouped to provide a total score. It is possible that there is a range of difficulty of tasks. The formulation of the domain-independent tasks includes parameter lists that range in number from 1 to 3. It is possible that the number of parameters is predictive of the difficulty of the tasks. The null hypothesis to support testing of this idea is:

H3: Scores for individual subtasks are not significantly different from each other.

Rejection of this hypothesis allows exploration of a secondary hypothesis:

H3a: There is no correlation between the number of parameters for an individual subtask and users' performance with the task.

Taken together, the rejection of this hypothesis allows the conclusion that test formats can be produced that can range from very easy to very difficult depending on the cardinality of the parameter list of a task.

The final hypothesis related to performance measures is to determine the effect of scenario difficulty. In its null form, the hypothesis is:

H4: The number of terms depicted in a display does not affect either time to perform a task set nor the number of correct answers.

The initial studies showed that visual displays were used less well than the other prototype displays in 2-term situations but were advantageous in the more difficult, 3-term display series. The important sub-hypothesis related to this point would try to determine whether this trend persists in the vector condition.

H4a: Each three-term display is associated with poorer performance than the paired two-term display.

The final major hypothesis of the proposed research is related to the measure of user preferences.

H5: There is no significant difference among the displays with respect to user preference.

H5a: Users express no preference for displays with which they perform better.

H5b: Users express no preference for displays based on order of presentation.

H5c: Users express similar preference rankings regardless of scenario difficulty.

1.5 Limitations

A second issue to consider is the range of display types that might be considered. In the preliminary study the displays that were tested were text-based (full-text and word lists), tabular, graphical (scatter plot), icon list, and a visual method based on the VIBE positioning algorithm. These display types are representative of two of the five types of displays described by Lohse (1990). The other types are diagrams, icons and maps. Perhaps a Venn diagram would be an appropriate instantiation of the diagram category. The ‘icon’ type does not have an obvious mapping into the IR domain. ‘Maps’ are only suitable for rendering higher dimensional data than the simple 2- and 3-term conditions being studied here. Since this study is based on a reference-point model of information retrieval and visualization, further investigation would be required to show whether other display types could be successfully tested with the proposed method.

The nature of the study might appear at first glance to be a ‘strawman’ situation since ‘visual’ tasks are being tested in a ‘visual’ environment. The argument could be made that it is only reasonable that such tasks would be performed better than non-visual tasks. However, until there exists a taxonomy of these non-visual tasks, it is not possible to compare performance across task types. In addition, there is no evidence that these ‘visual’ tasks are performed better with a ‘visual’ interface. In fact, it is possible that all tasks are performed better with full-text and that only a subset of all tasks is suitable for visual presentations.

1.6 Assumptions

Subjects recruited from the University community are suitable for the study.

2 Background Literature

Previous research pertinent to the proposed topic includes three major items. First, it is important to know what types of visualizations have been developed for information retrieval and browsing. Section 2.1 provides a discussion of the various systems. Second, the extent to which evaluation has been applied to such interfaces needs to be determined. Although there are quite a few visual interfaces for IR, there have been few user studies. Section 2.2 discusses the details of the designs of the user studies performed to date. Third, visual task taxonomies and information retrieval taxonomies will be discussed. Task models can be constructed at many levels of granularity. The library literature is replete with high-level models of user strategies. Section 2.3.1 discusses some of these. It is difficult to determine from these models what elemental tasks might be selected for testing visual interfaces. However, it is clear from scanning the categories proposed that visual tools could be applied as alternatives to text-based presentations. At a much lower level of task specification, one finds visualization-specific task typologies. These are discussed in some detail in Section 2.3.2. Finally, the results of two studies of user performance and preferences in 2-term and 3-term Boolean test scenarios are presented in section 2.4.

2.1 Visual Interfaces for Information Retrieval and Browsing

Visual interfaces for information retrieval and browsing, as discussed in Section 1.3.2, can take many forms, e.g., reference point systems, map displays, 3-dimensional systems. The focus of this proposal is on low-dimensional reference point systems. There are many map-type visualizations including SPIRE (Wise et al 1995), Themescape (Wise et al 1995), self-organizing maps (Lin 1991, 1997), and BEAD (Chalmers 1996). These systems are relevant in the overall context of visualization in information but are outside the focus of this paper. The suggestion could be made that maps are quantitatively but not qualitatively different from reference point systems. This study, however, does not depend on the validity of this suggestion. Also excluded from review, although closely allied, are visualizations that are based on data derived from non-document sources, e.g., databases. Also missing from this analysis are 3-dimensional systems such as LyberWorld (Hemmje et al 1994), VR-VIBE (Benford et al 1995) and the set of systems developed at NIST and reported by Cugini et al (1996). The increased difficulty of interpreting these systems is outside the scope of the simple approach proposed here. Future extension to such systems would need to be evaluated before applying the method.

Table 1 shows a list of pertinent low-dimensional document visualization systems. Each of these displays relies on representation of documents as vectors, although systems such as InfoCrystal contain Boolean vectors. The purpose of exploring these interfaces in the current context is to determine whether there are unifying features or common rendering techniques that map to the prototype displays planned for evaluation in this study. Such an analysis should provide an indication of the utility of the intended approach and possibility for its scalability.

Table 1: Reference Point Visualizations

Visualization System	Reference	Tested?
Component Scale Drawing	Crouch & Korfhage 1990	yes
Cougar	Hearst 1994	no
GUIDO	Nuchprayoon 1996	yes
InfoCrystal	Spoerri 1993	no
SIRRA	Aalbersberg 1995	no
Space	Newby 1992	yes
TileBars	Hearst 1995	yes
VIBE	Olsen et al 1992	yes
WebVIBE	Morse & Lewis 1997	yes

2.1.1 Description of Selected Visualizations

Component Scale Drawing (Crouch & Korfhage 1990) is shown in Figure 2a. The graph shows query terms on the x-axis; the order of the terms is determined by the user's weighting. The y-axis indicates classes of term weights. Documents are represented as broken lines and the query itself is represented as a solid line. The purpose of the system is to assist users in determining the similarity of query and documents.

Cougar (Hearst 1994), shown in Figure 2b, uses a Venn diagram to represent the relationship between documents and query terms. Each query term is mapped to a circular area of the display. Document identifiers are shown as icons in the list box in the appropriate sector of the graph.

GUIDO (Nuchprayoon 1996) uses a novel type of display to allow sophisticated mathematical manipulation of similarity metrics (Figure 2c). The display allows the selection of two reference points which are then shown a points on the x- and y-axes. The resulting document set is then displayed in the plank that is generated at a 45˚ angle in the graph. Various retrieval caps and metrics are provided to enhance selection of desirable subsets of documents.

InfoCrystal (Spoerri 1993) is another example of a reference point system that is based on the Venn diagram model. Figure 2d shows the results of a 4-term Boolean query. The query terms are indicated at the vertices and the resultant subsets are associated with the other shapes shown in the bounding box. The number of edges of an included shape indicates the number of query terms and the direction of the vertex points to the query term. InfoCrystal is useful for determining the distribution of documents in a document collection that satisfy each of the possible Boolean queries. Spoerri describes higher dimensional InfoCrystals. He also illustrates a version that allows the creation of weighted vector queries, although the display looks tremendously complex.

An example of SIRRA (Aalbersberg 1995) is shown in Figure 2e. The visual interface incorporates a list of multicolor icons. Each query term is assigned a color and each icon represents a single document. Users can compare documents with respect to the strength of a query term within and across documents in a set.

Space (Newby 1992) is an IR system based on the principles of navigation which he defines as human behavior to make sense of an information space. The example of Space shown in Figure 2f is a part of the interface termed the 'Navigation window. Keywords and document identifiers float in the field. The placement of the documents with respect to the key terms is based on the relative strength of attraction of the document for the term. Other windows in the interface include a map view and a key term list.

TileBars (Hearst 1992) is based on segmentation of the underlying full-text into topics. Figure 2g shows the results of a 3-term query. Each large box represents a single document and the grayscale rectangular areas with them show the relative amount of the term in sequential fragments of the text. Each row within the large rectangle represents a query term or combination of terms. This visual method has been incorporated into the Scatter/Gather interface.

VIBE (Olsen et al 1992) is shown in Figure 2h. Query terms are shown at the vertices of the figure and the resulting retrieved set of documents are shown as icons scattered around the enclosing triangle. VIBE can be used to represent the results of Boolean as well as vector queries. This visual is part of a fully featured interface that allows users to interact with term lists and moderately large document collections.

WebVIBE (Morse & Lewis 1997) is an obvious relative of VIBE (Figure 2i). An attempt was made to reduce the number of features which was hypothesized to affect the ability of people to use the interface effectively. Since the information source is not local, the vector representation of the document content must be determined 'on-the-fly'. The naive physics metaphor of magnetism was employed to encourage interaction and learnability.

2.1.2 Analysis of Reference Point Visualizations

Several types of visual approaches can be seen when the above examples are analyzed. Three major categories are: Venn diagram, icon lists, spatial systems. Both Cougar (Hearst 1994) and InfoCrystal (Spoerri 1993) are based on the Venn diagram. SIRRA (Aalbersberg 1995) and TileBars (Hearst 1992) present icon lists. VIBE (Olsen et al 1992), WebVIBE (Morse & Lewis 1997) and Space (Newby 1992) employ a spatial method to render the relationship of documents and key terms. Although GUIDO (Nuchprayoon 1996) and Component Scale Drawing (Crouch & Korfhage 1990) are both graphical representations that show an x-y graph, there does not appear to be much that makes them similar. The latter uses a line graph and nominal scales while the former uses icons and continuous scales.

The prototypes described for testing in this proposal contain representatives of the icon list and spatial types of displays. The Venn diagram appears to be useful for displaying Boolean data but falls short of making compelling displays of vector data. The x-y graph type that is to be used in the 2-term prototype testing is clearly unrelated to either GUIDO or Component Scale Drawing.

2.2 Evaluation of Visual Interfaces for Information Retrieval and Browsing

Of the reference-point visualizations discussed in the previous section, only Component Scale Drawing, GUIDO, Space, TileBars and VIBE have been subjected to user studies. Each of the studies will be reviewed briefly here. The purpose in reviewing these evaluation methods is to determine what tasks were given to the subjects and also to determine which interfaces were subjected to usability evaluation as opposed to task performance evaluations. Other pertinent aspects of the studies, such as number of subjects used and characteristics of the user/subject populations, will be noted where such information exists.

Component Scale Drawing (Crouch & Korfhage 1990) was tested by presenting a user with a display based on each of 15 queries. The task of the user was to use the CSD tools to rank the documents with respect to their similarity to the underlying query. The rankings were then compared with the known relevance rankings. The results showed that there was a highly significant relationship between the user’s rankings and the known rankings (Spearman coefficient 0.85 across queries). The number of users is not clear from the paper but may limited to a single person. The task is clearly highly specific to the interface.

GUIDO (Nuchprayoon 1996) was subjected to usability testing. Sixteen subjects were charged with performing nine information retrieval tasks. Tasks were graded as 'easy' and 'hard'; 'easy' tasks presented the test subject with pre-selected metrics, retrieval threshold, and POIs, while 'hard' tasks required the subject to select each on his own. The primary goal of each task was to choose the 8 'best' documents from the resultant display. The primary measure was the amount of time that it took the subjects to perform the document selection. The results showed that there were some interactions among the retrieval threshold and metric. Subjects provided positive feedback on the GUIDO system.

Newby (1992) tested Space with 20 users. They were provided with a full system display which included the multi-window display and a mouse and PowerGlove. His primary goal was to test the ability of users to navigate abstract spaces. Users performed two information retrieval tasks: 1) a closed-ended question that was based on key-term synonymy and 2) an open-ended task based on a vague statement of information need. The ‘Space’ system was compared with a traditional IR system (Prism). Newby demonstrated considerable learnability of the Space system and high user ratings. Comparison with the more traditional system showed that users preferred what they were already familiar with.

TileBars (Hearst 1995) has not been subjected to the same type of user studies mentioned thus far. The TileBars interface itself has not been user tested but the algorithm underlying its segmentation of text into topics has been compared with human segmentation of the same text. High correlations were found between the two types of segment generators. This study, however, is not useful for the purposes of the proposed work.

VIBE has been subjected to user testing by Koshman (1996). She compared performance of expert and novice searchers using VIBE or a conventional text-based interface (AskSAM). There were 15 novices, 12 online search experts and 4 subjects who had VIBE system expertise. Due to the small sample of VIBE experts, the study concentrates on the former 2 groups. This was a thorough usability study of the VIBE interface in that it sought to measure user’s performance at tasks that required use of novel interface features. Subjects performed 7 tasks that were chosen for their likelihood to represent ‘normal’ user IR tasks. The tasks were structured to cover a variety of ‘information tasks’ as opposed to ‘navigation tasks’ since many of the latter tasks could not be realized in the VIBE interface (p58). In general, the tasks have a Boolean flavor, e.g., how many documents contain (all, one, or two) terms. Scenarios were constructed to provide a naturalistic information seeking setting. Usability was assessed by measuring: 1) system familiarity time, 2)-task performance speed, 3) frequency with which online help is accessed, 4) Number of errors in task results, 5) subjective satisfaction, and 6) system feature retention. Familiarity time showed no difference for interface nor for expertise level. She showed that time to complete tasks was inversely related to expertise. Users preferred the familiar, text-based interface to the visual VIBE interface. She states that users retained what they learned from one session to the next but believes that this was due to increased ‘familiarity with the kinds of task and the tools need to perform the tasks’. It is reasonable to conclude that the Boolean nature of the tasks chosen for this study influenced the outcome, in that Boolean tasks are probably accomplished more effectively with Boolean systems such as AskSAM.

WebVIBE was subjected to usability testing (Morse & Lewis 1997). The overall aim of these studies was to determine whether defeaturing existing IR interfaces could produce interfaces which could be used successfully in 'walk-up' systems, especially on the Web. The results showed that users could indeed form correct inferences about retrieved documents and their relationship to the query terms without extensive training.

2.3 Task Models

Several frameworks for information visualization have been proposed (Kennedy et al. 1996, Rogowitz & Treinish 1993, Wehrend & Lewis 1990). Some of these structures include modeling of the user. Increasingly, user-centered design is being adopted. In this paradigm, explicit representation of the user is important. The user can be modeled in the system by assessing the user’s goals and/or defining the tasks the user needs to perform.

This section will present several task models, some of which are domain-dependent and others which are independent of domain. The granularity of analysis runs the gamut from very fine-grained to very high level.

A classification scheme supports the development of task sets for system evaluation and lays the groundwork for the development of automatic visualization systems. By knowing the data that exists, the requirements of the interface and the goals of the user, it becomes possible to ask how one might build visualizations automatically. The purpose of this paper is to discuss the issues that contribute to understanding how best to approach the evaluation of document visualization systems.

2.3.1 Domain-Dependent models

Modeling users in information retrieval situations has a long history in library science. Systems have changed from having only titles and minimal other metadata to having abstracts to the present situation in which most texts are available as full-texts. Systems have increased in capacity to accommodate the requirements of full-text storage and systems have taken advantage of increased computing power to perform searches. Where once an intermediary worked with a user to formulate a query which would be submitted in essentially batch mode, many current systems are used by the end user and searches are interactive. Only recently have visualizations been developed that might help satisfy some of the user’s information needs. The models developed by library scientists have changed to accommodate evolving resources.

The following task models developed for use in library environments were chosen to show how varied the approaches are and to describe some models that might actually have some utility in evaluating visual interfaces. Reviews of the historical evolution of information retrieval can be found in Spink (1997) and Bates (1989).

2.2.1.1 Marchionini

The breakdown of the information seeking provided by Marchionini (1992) describes a network of tasks that are performed in various, user-defined orders until the information-seeking problem is solved. Marchionini states clearly that there are two basic forms of information needs – fact knowledge and browsing. The subtasks that he provides are not different for the two types of needs and are:

Define the problem – this is the required first step in the process. As problem-solving proceeds, the problem will undergo a series of revisions.
Select the source – a user must choose an entry point for a search, e.g., the Web via a particular engine, a library catalog, or an on-line periodical server.
Articulate the problem – by articulation he means to form a query that can be processed by the system.
Examine the results – the user must review the items returned in response to a query in order to determine whether useful information was retrieved.
Extract information – when interesting and/or useful items are found, the user must able to acquire a physical or electronic copy of the material.

This particular task list is relevant to interface design but provides little guidance on what subtasks might be. It is also highly grounded in the traditional information retrieval paradigm in that it relies on query formation and an iterative performance of steps to arrive at a satisfactory solution.

2.3.1.2 Bates

Bates (1989) describes a ‘berrypicking’ model of information retrieval, which she contrasts with the classical method. Her description of browsing in a world of text seems to offer similarities to visual representations. She presents a list of 6 tasks:

Footnote chasing – a ‘backward chaining’ method, which enables a user to find material which preceded it in publication time.
Citation searching—a ‘forward chaining’ method which allow users to find other papers that cite the same reference material.
Journal run – once a user finds a ‘good’ journal, he will scan entire issues and even volumes. Precision with the core journals in a field is very high with this method.
Area scanning – in real libraries, books having the same topic codes are stored in close proximity.
Users frequently start with a single reference and expand their search by literally browsing the stacks.
Subject search in bibliographies and abstracting and indexing service – strategy based on the commonly available indexes.
Author searching – self-explanatory.

2.3.1.3 Belkin

Belkin et al. (1995) propose that information seeking can be defined with respect to four dimensions as shown in the following table.

Searching as a method of interaction refers to trying to find some known item, while scanning refers to trying to find something interesting. The goal of interaction might be to learn something about an item or it might be to select the item. When looking for items, the user might specify what should be looked for or he might find it by recognizing it. The distinction between information and meta-information is the same distinction that has been made in this paper.

Belkin (Belkin et al. 1995) notes that there are 16 possible information-seeking strategies if each of the components is viewed as a Boolean value. For instance, traditional information retrieval might be characterized as Selecting + Specification + Meta-information + any method of interaction and information visualization would be described as Learning + Recognition + Information + any method of interaction but frequently Scanning.

2.3.1.4 VIRI Research Group Tasks

The VIRI (Visual Information Retrieval Interface) research group developed a set of tasks that we term ‘tool-enabled’ tasks. The idea behind the name is that visualizations provide ways of doing things that might not be possible or that might be much more difficult using less visual means. The list is as follows:

Choice among data sets—by this we mean that the user can be shown a map of collections that can guide his further search. This is a task that is synonymous with Marchionini’s ‘Select a source’ task (1992).
Find without peeking—this task is motivated by costs associated with retrieving information sources. If a visualization were to provide powerful tools for restricting a set of possible sources, then the costs of transferring the data could be reduced.
Simultaneous criteria—this task derives from the fact that the same data viewed from a different perspective would be expected to give a different view and therefore would support different inferences. For example, a proposal might be interesting if viewed by economic, political, or other domain-attribute.
Massive retrieval with punishment for extraneous items—this item has much in common with ‘find without peeking.’ In ‘find without peeking’ the goal is to locate a single document, while this ‘massive retrieval without punishment' assumes that many items will be retrieved. It focuses the task on query formation, albeit implicit querying.
Pick best 10 -- this is such a common task that each of the task models discussed includes it in some form.
Sort by term count for retrieved—this is a ranking task.

2.3.2 Domain-Independent Visual Taxonomies
2.3.2.1 Wehrend & Lewis

The task classification of Wehrend & Lewis (1990) is a low-level, domain-independent taxonomy of tasks that users might perform in a visual environment. Domain-independence allows generalizability. Wehrend &; Lewis’ classification consists of the following set of user actions.

Locate: This action can be applied to dependent as well as to independent variables. It covers interaction techniques that allow the user to find special data entries. Annotation techniques are covered by this action, for example an arrow marking the most interesting point of the display. Locate can also work like a filter, e.g., by highlighting data items that lie in a special range. Locate includes search for an object that the user already knows about.
Identify: Identify is similar to Locate, but in this case the user is being asked to describe an object that was not necessarily known previously.
Distinguish: This action allows distinguishing between different values of the same variable, e.g., for a user to know which objects have already been identified or interacted with. The interface might show different iconic representations for each object type.
Categorize: Categorizing means to define divisions that displayed objects can be sorted by. Examples of VIBE (Olsen 1993) tasks that are categorizations are 1) To define all the regions of a 3-POI (point of interest) Boolean display, 2) Draw boundaries in a vector VIBE 3-POI display for each of the possible Boolean combinations of terms.
Cluster: The cluster task covers techniques that allow us to determine whether data entries are clustered or not. The ambiguity introduced by flattening the hyperdimensional spaces into two dimensions would be probed by this activity. It includes finding gaps in the display field (cluster of nothing).
Distribution: The distribution action is closely related to cluster in much the same way that locate and identify are related. To distribute, the user needs to describe the overall pattern while cluster merely asks that the set be detected.
Rank: Ranking is only possible for scalar and ordinal data. Users could be asked to indicate the best and worst cases in a display. Since nominal data cannot be ranked, it is important that displays of nominal data be designed so that the user is discouraged from trying to perform such actions.
Compare within entities: This action describes tasks in which a user is called upon to decide something based on the attributes of similar objects.
Compare between relations: When different entities are used as the basis of comparison, the ‘compare between relations’ operator is used. For instance, if a set of objects has been marked as seen and the remainder of the set is unseen, then the user might compare and contrast attributes of the sets.
Associate: The associate action calls upon the user to form relationships between objects in a display.
Correlate: If objects in a display have multiple attributes, then it should be possible to discern which other objects share attributes. For instance, in a scatterplot in which the marks have shape and color as well as their x and y position, the objects should be groupable by any of the attributes.

2.3.2.2 Zhou & Feiner

A visual task taxonomy has been developed by Zhou & Feiner (1998). This taxonomy extends that of Wehrend & Lewis (1990) by defining additional tasks, by parameterizing the tasks, and by developing a set of dimensions by which the tasks can be grouped. Table 3 shows a list of the elemental visual tasks together with task parameter list (shown in angle brackets).

Table 3: Visual Task Taxonomy (Zhou & Feiner 1998)

Associate<?x, ?y>	Correlate<?x1,..,?xn >	Locate<?x, ?locator>	Encode<?x>
Collocate<?x, ?y>	Plot<?x1,..,?xn >	Position<?x, ?locator >	Label<?x>
Connect<?x, ?y>	MarkCompose<?x1,..,?xn>	Situate<?x, ?locator >	Symbolize<?x>
Unite<?x,?x-part>	Distinguish<?x, ?y>	Pinpoint<?x, ?locator >	Quantify<?x>
Attach<?x,?x-part >	MarkDistribute<?x, ?y>	Outline<?x, ?locator >	Iconify<?x>
Background<?x, ?bkg>	Isolate<?x, ?y>	Rank<?x1,..,?xn ,?attr>	Portray<?x>
Categorize<?x1,..,?xn>	Emphasize<?x,?x-part >	Time<?x1,..,?xn >	Tabulate<?x>
Mark<?x1,..,?xn>	Focus<?x,?x-part >	Reveal<?x,?x-part >	Plot<?x>
Cluster<?cluster,..,?xn>	Isolate<?x,?x-part >	Expose<?x,?x-part >	Structure<?x>
Outline<?cluster>	Reinforce<?x,?x-part >	Itemize<?x,?x-part >	Trace<?x>
Individualize<?cluster>	Generalize<?x1,..,?xn >	Specify<?x,?x-part >	Map<?x>
Compare<?x, ?y>	Merge<?x1,..,?xn >	Separate<?x,?x-part >
Differentiate<?x, ?y>	Identify<?x, ?identifier>	Switch<?x, ?y>
Intersect<?x, ?y>	Name<?x, ?name>
	Portray<?x, ?image>
	Individualize<?x, ?attr>
	Profile<?x, ?profile>

The major dimensions of visual tasks that they describe are visual accomplishments and visual implications:

"Visual accomplishments describe the type of presentation intents that a visual might help to achieve, while visual implications specify a particular type of visual action that a visual task may carry out." -- Zhou & Feiner (1998)

The structure that results from applying the visual accomplishments dimension is a hierarchy. The major branches describe tasks that ‘Enable’ and tasks that ‘Inform’. The former are further decomposed into exploration tasks and compute tasks, while the later are described as elaborate and summarize tasks. The breakdown along the line of visual implications seems that it might be useful in developing domain-dependent tasks. Zhou & Feiner propose three types of implications: 1. visual organization, 2. visual signaling, and 3. visual transformations. The overall structure of the implications dimension of the visual taxonomy is shown in Table 4.

Table 4: Visual Implications and related elemental tasks

Implication	Type	Subtype	Elemental tasks
Organization	Visual grouping	Proximity	associate, cluster, locate
		Similarity	categorize, cluster, distinguish
		Continuity	associate, locate, reveal
		Closure	cluster, locate, outline
	Visual attention		cluster, distinguish, emphasize, locate
	Visual sequence		emphasize, identify, rank
	Visual composition		associate, correlate, identify, reveal
Signaling	Structuring		tabulate, plot, structure, trace, map
	Encoding		label, symbolize, portray, quantify
Transformation	Modification		emphasize, generalize, reveal
	Transition		switch

2.3.3 Summary of Task Models

Each of the task models presented above presents a different view of the problem of evaluating visual information retrieval systems. Domain-dependent models are very high-level abstractions of user tasks. While it would be desirable to employ such a model, there is no obvious method for selecting possible types of tasks. The fact that the scales of the various typologies are presented at several different levels of granularity adds to the difficulty of determining which of the domain-dependent models to use. In addition, the studies done with library patrons focus on the tasks that the users seek to accomplish are perhaps learned behaviors due to their prior knowledge of how libraries work—they tend to ask questions that they know can be answered. Visualizations might support a different way of asking questions and getting answers. The task list developed by the VIRI Research group is not structured as a typology at all. Even the more structured typologies do not answer the question of what fraction of an information browsing user's needs are contained in each taxonomic category. The difficulty in using the domain-independent models is the number of possible mappings to a particular domain.

2.4 Initial Studies

2.4.1 2-term Boolean

The first step in the proposed bottom-up testing plan involved presenting a variety of simple interfaces to groups of undergraduate students. The students were registered in programs at the University of Pittsburgh or Molde College, Norway. The simple interfaces are shown in Figure 3; they are labeled text, icon-list, table, graph and spring. The last of these is based on the VIBE display model (Olsen et 1992).

Subjects took the test as a paper-and-pencil exercise during a normal class session. Instructional materials were limited to a short presentation of the displays using dummy data. The order of presenting the displays was randomized. The total randomization entailed 120 different orderings. For each display the subjects were asked two questions: 1. Circle the item(s) that contain term X and Y, and 2) How many items contain the term X. After answering all the questions for each display the subjects were asked to rank the displays according to their personal preference. Demographic information was also collected at this time.

Two hundred sixteen (216) subjects took part in the study. There were 121 men and 95 women. Seventy-five students were from Norway and the remainder took the test in Pittsburgh. Although the test was administered as part of the requirements of an undergraduate class, there were 9 graduate students enrolled in these classes. There were 71 freshman, 19 sophomores, 43 juniors, and 72 seniors in the sample. One hundred eighteen students were under 23 years of age; 71 were between 23 and 30 and the remaining 21 were over 30. Of the students in the Pittsburgh sample, 29 reported that English was not their native language.

The performance of subjects was similar with respect to gender, age, and year in program. Initial analysis of the data showed a significantly better performance by subjects in the 'Norwegian' group compared with the 'American' group. However, all of the variation could be explained by the high number of subjects in the latter sample that spoke a language other that English as their native tongue. When native language was factored into the analysis, the discrepant performance was abolished.

Performance was affected by question type, display type, and order of presentation. The first question (‘Circle items about terms X and Y’) was answered correctly more often that the OR question regardless of display type. This finding indicates that attention must be paid to the construction of probe questions for interface testing. Average performance for each display type showed that the text and icon lists were easiest to comprehend, followed by the table and finally the graph and spring displays were most difficult. The order in which the displays were presented was also a significant predictor of successful performance. Figure 4 shows the results of this interaction. It shows that displays such as the spring, which were difficult to understand and use if presented first, were easier to use just because of the practice with answering similar questions during other display trials.

The subjects’ preferences showed that performance was not the best predictor of preference. Fully 47% of subject rated the text display as their least favorite despite superior performance with that display. Over 60% of the respondents indicated a preference for the visual displays, i.e., the icon list and spring display. It is possible that there was a Hawthorne effect, since the subjects might easily have assumed that the investigators were testing the visual displays and might be looking to demonstrate their superiority. Users prefer visual displays even if these displays do not always provide the best environments for task performance.

This study serves as a baseline against which more complicated IR designs can be tested. It shows that subjects can learn novel interfaces with a minimum of effort, that they prefer visual interfaces and that there is sufficient sensitivity of the measures employed in this study to make clear observations of differences in both performance and preference. The results of this study will be published soon (Morse et al 1998).

2.4.2 3-term Boolean

The 3-term Boolean experiment has been administered to 70 students enrolled in undergraduate Information Science courses at the University of Pittsburgh. The full test is included in the Appendix. Some subjects took the test as a paper-and-pencil exercise during a normal class period; other students volunteered or were assigned to provide answers using a computer. These students were provided with the URL and a deadline (usually 1-week) by which the experiment was to be completed. Seven students were recruited for videotaping, audiotaping and debriefing.

The results show that there was no significant difference between computer-mediated administration and paper and pencil. This finding is supportive of the approach detailed later in the methodology of this proposal which relies solely on computer delivery of tests. Thirty-one subjects performed the experiment in the computer-mediated mode. Timing data (Table 5) for these subjects showed highly significant differences among the groups when analyzed with a repeated-measures ANOVA.

Table 5: Effect of display type on performance

Display Type	Time to complete all questions (min; mean + SE)
Word List	2.8 + 0.3
Icon List	2.4 + 0.3
Table	2.1 + 0.2
VIBE-like	1.9 + 0.2

The primary hypothesis that was being tested in this experiment was the enhanced difficulty of the setting (2-term vs. 3 term Boolean) would show a superior performance with visual displays. This immunity to performance decay would be accompanied by an increased preference of subjects for the visual displays. Analysis of the results is still in its preliminary stage but it is clear that subjects performed least well with the ‘text’ and best with the icon-based display. The spring/vibe display was intermediate. Preference data have not yet been analyzed thoroughly but initial results confirm those of the 2-term tests—users prefer visual displays (‘icon’ and ‘vibe’) and dislike ‘text’ displays.

3 Methodology
3.1 Document Vector Data

The data for these studies will be vector-based representations of documents. The values in the vector elements will be based on raw frequency of terms in individual documents. The source of the documents will be the AP newswire collection of the TREC document set. Sets of term pairs and term triplets will be selected from the high occurrence terms in the collection.

3.2 Subjects

Subjects will be recruited from the general University community of the University of Pittsburgh and other local colleges. Based on the results of a power analysis using the timing data gathered in the preliminary studies, 50 subjects will be sufficient to accomplish the goals of the study. However, in order to insure against dropouts, outliers, incomplete experiments, and network failures, and in order to present all the possible orderings of 4 (3-term) and 5 (2-term) test displays, larger numbers of subjects will be entered into the study. There are 20 possible orderings of 4 things taken 4 at a time; 60 subjects will be used for the 3-term vector study. For the 2-term condition, five different displays will be compared leading to a requirement for 120 subjects.

Approximately 20 subjects will be inducted into the laboratory testing, ten for the 2-term condition and 10 for the 3-term condition. These subjects will receive $12 for participating in the 1-1.5 hour study. These subjects will be tested before the rest of their respective group. Their performance will serve to refine the wording of the test and to clarify other ambiguities. The group as a whole will be monitored for errors that appear to be due to conditions of the test not due to the variables being tested, such as line spacing too small so that counting errors occur with high frequency.

The initial studies indicated a confounding effect of native language. Approximately 20% of the American sample considered a language other than English as their native tongue. No effort will be made to control for native language in this study, which might allow for replication of the original finding. If a subject indicates that English is not his native language, then an additional subject will be recruited so that the final sample size will be based on number of English-speakers. .

3.3 Mode of administration

Both 2-term and 3-term vector tests will be presented to users using a computer. The test materials will be composed of HTML pages with embedded, client-side Javascript to control for completeness. Each answer will be a required element and a new page will not be delivered until each blank if filled in.

3.4 Independent Variables

3.4.1 Display type

The displays that will be tested are two visual displays (icon list and VIBE-like), one numeric display (table) and one word-based display. In addition, in the 2-term condition, a scatterplot will be tested. A 3-term scatterplot would require a three-dimensional rendering and such displays are outside the scope of this proposal. In a Boolean system, the table and VIBE displays present summary data and the word and icon lists present an element for each item in the collection. In the vector situation, however, few of the document vectors are identical leading to a situation in which the table grows to approximate the icon list or word list in size. Similarly, the VIBE display of 3 Boolean terms produces a maximum of 7 points, while the vector version could produce a number of points equal to the number of documents containing at least one of the terms.

It is pertinent to note that the initial Boolean studies presented users with interfaces that were either lists (words or icons) or summaries (tables, graphs, VIBE-like). In order to arrive at a correct answer with a summary display, the user had to add numbers. The list displays showed ordered data and a subject could generate subtotals and then proceed to add the subtotals to derive the final answer. A more common approach to list data is to merely count each element that satisfies the criteria. Subjects who were videotaped while performing the 3-term Boolean test were observed to use the counting method for lists. Differences in performance were observed which indicated that the counting task was easier. Considering that any strategy that maps vector data will yield interfaces that require counting rather than adding, it will be interesting to observe if the task that was previously identified as being easier will add to the ease of use of the interfaces.

The displays will contain approximately 10-15 documents per POI. This compares favorably with the number used in the initial Boolean study. The laboratory pre-test will include a debriefing item to assess the adequacy of this choice. If subjects indicate that the number is too few or too great for all of the displays, the number will be adjusted.

3.4.2 Order of presentation

The order of presentation of the interfaces will be randomized. In order to enable detection of learning, all tasks with a single interface will be presented as a block before showing the user a different display type. The tasks will be presented all at once so that the user may work on them in any order. This will also allow the users to correct answers that they discover during the answering of later questions.

3.4.3 Task type

Tasks will be developed from the parameterized taxonomy described by Zhou & Feiner and described in Section 2.3.2.2. Ten to fifteen of these elemental task types will be mapped to specific tasks from the domain of information retrieval, browsing and organization. For example, the Locate (?x, ?locator) task might be probed by asking the subject to find a document identified with a single reference point. Some of the tasks from the taxonomy as proposed by Zhou & Feiner (1998) take several parameters. It is expected that tasks that take more parameters will be more difficult. The types of actual questions that may be posed will take several forms. Multiple-choice, true/false, short answer will be used. However, a particular task will always be probed by the same format question across displays.

3.4.4 Difficulty of task setting

The set of vector tasks is more difficult that those posed in the Boolean study. They ask the user to make finer distinctions among sets of documents even though the number of POIs is the same. The plan for this proposed study is to present different groups of subjects with 2-term and 3-term vector display sets.

3.5 Dependent Measures

1. Performance on each of the subtasks will be measured by the correctness of the answers.

2. Total time to complete all of the subtasks with a single display will be used as an indicator of overall ease of using the display.

3. Preference for displays will be recorded as a ranking of the displays.

3.6 Covariates

The following data will be collected to determine whether there is any influence of them on the primary measures described above. The first four are demographic data that showed no effects in the preliminary study (age, computer experience, self-assessment of computer expertise, and year in college program). Information about native language, a confounding factor in the preliminary study will be gathered to determine whether this effect will be replicated.

Three factors related to the computer equipment used for the testing will be gathered (machine speed, monitor size and modem speed). Usability studies conducted in our laboratory have shown correlations between computer environment and user’s satisfaction with a test interface (Morse: unpublished results). That experiment, however, was based on a Java applet that required a long time to load. The current test plan requires only the delivery of simple HTML pages. Therefore, it is anticipated that these factors will not be problematic in this setting.

3.7 Statistical Analysis

The initial studies indicated that neither performance nor preference data was normally distributed. Therefore, non-parametric statistical methods will be employed for the analysis of the data. T iming data, on the other hand, have been shown in the 3-term Boolean study to suitable for parametric methods.

Appendix {Examples of 2- and 3-term Boolean paper-and-pencil tests}

4 Bibliography

Aalbersberg, I.J. 1995. Personal communication in Nuchprayoon (1996)

Bates M. 1989. A ‘berrypicking’ model of information retrieval. Online Review 13(5):408-424

Belkin, N.J., C. Cool, A. Stein, and U. Thiel. 1995. Cases, scripts, and information-seeking strategies: on the design of interactive information retrieval systems. Expert Systems with Applications, 9:

Benford, S.D., D. Snowdon, C. Greenhalgh, R. Ingram, I. Knox and C. Brown. 1995. VR-VIBE: a virtual environment for co-operative information retrieval. Eurographics ‘95, 30^th August - 1^st September, Maastricht, The Netherlands, 349-360.

Chalmers, M. 1996. A linear iteration time layout algorithm for visualising high-dimensional data. Proceedings of IEEE Visualization ‘96, 127-132

Crouch, D.B. and R.R. Korfhage. 1990. The use of visual representations in information retrieval applications. In T. Ichikawa, E. Jungert, & R.R. Korfhage (Eds.), Visual Languages and Applications, pp. 305-326, New York: Plenum

Cugini, J., C. Piatko, and S. Laskowski. 1996. Interactive 3D visualization for document retrieval. NIST publication.

Cutting, D.R., D.R. Karger, J.P. Pedersen and J.W. Tukey. 1992. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the fifteenth annual international ACM SIGIR Conference on Research and Development in Information Retrieval (Copenhagen) 318-329

Hearst, M.A. 1994. Using categories to provide context for full-text retrieval results. In Proceedings of the RIAO ' 94, New York.

Hearst, M.A. 1995. TileBars: visualization of term distribution information in full text information access. CHI ’95 Proceedings, 213-220

Hemmje, M., C. Kunkel, and A. Willett. 1994. LyberWorld—a visualization user interface supporting fulltext retrieval. SIGIR ‘94, 249-259

Kennedy, J.B., K.J. Mitchell and P.J. Barclay. 1996. A framework for information visualisation. SIGMOD Record 25(4):30-34

Kim, H. & Korfhage, R. R. (1994). BIRD: Browsing Interface for the Retrieval of Documents. In Proceedings of the 1994 IEEE Symposium on Visual Languages, St. Louis, 176-177.

Korfhage, R.R. 1997. Information Storage and Retrieval. John Wiley & Sons, New York. pp. 349

Koshman, S. 1996. Usability testing of a prototype visualization-based information retrieval system. Dissertation, University of Pittsburgh.

Lin, X. 1997. Map displays for information retrieval. JASIS 48(1):40-54

Lin, X., D. Soergel, and G. Marchionini. 1991. A self-organizing semantic map for information retrieval. SIGIR ‘91, 262-269.

Lohse, G., H. Rueter, K. Biolsi, and N. Walker. 1990. Classifying visual knowledge representations: a foundation for visualization research. Visualization '90: Proceedings of the First Conference on Visualization, 131-138

Mackinlay, J.D., R. Rao, and S.K. Card. 1995. An organic user interface for searching citation links. CHI ‘95, 67-73

Marchionini, G. 1992. Interfaces for end-user information seeking. JASIS 43(2):156-163

Marchionini, G. 1995. Information Seeking in Electronic Environments. New York: Cambridge University Press.

Morse, E., and M. Lewis. 1997. Why information visualizations sometimes fail. Proceedings of IEEE International Conference on Systems Man and Cybernetics, Orlando, FL, October 12-15, 1997.

Morse, E., M. Lewis, R.R.Korfhage, and K. Olsen. 1998. Evaluation of text, numeric and graphical presentations for information retrieval interfaces: User preference and task performance measures .Proceedings of IEEE International Conference on Systems Man and Cybernetics, San Diego, CA, October 11-14, 1998.

Newby, G.B. 1992. An investigation of the role of navigation for information retrieval. Proceedings of ASIS ‘92, 20-25

Nuchprayoon, A (1996). GUIDO: A Usability Study of its basic retrieval operations. Doctoral Dissertation. School of Information Sciences, University of Pittsburgh.

Nuchprayoon, A. & Korfhage, R. R. (1994). GUIDO, A Visual Tool for Retrieving Documents. In Proceedings 1994 IEEE Computer Society Workshop on Visual Languages, St. Louis, 64-71.

Olsen, K.A., R.R. Korfhage, M.B. Spring, K.M. Sochats, and J.G. Williams. 1993. Visualization of a document collection: The VIBE system. Information Processing and Management. 29(1): 69-81.

Pejtersen, A.M. 1988. Search strategies and database design for information retrieval in libraries. In. L.P. Goodstein, H.B. Andersen & S.E. Olsen (Eds.), Tasks, Errors and Mental Models, Hampshire, England: Taylor & Francis, pp. 171-192

Pirolli, P., P. Schank, M. Hearst, and C. Diehl. 1996. Scatter/Gather browsing communicates the topic structure of a very large text collection. CHI ‘96, 213-220

Rogowitz, B.E., and L.A. Treinish. 1993. An architecture for rule-based visualization. Proceedings of IEEE Visualization '93, 236-243

Spoerri, A. (1993). InfoCrystal: A Visual Tool for Information Retrieval. In Proceedings Visualization '93, San Jose, CA, 150-157.

Spink, A. 1997. Information science: a third feedback framework. Journal of the American Society for Information Science 48(8): 728-740.

Wehrend, S. 1992. Taxonomy of Visualization Goals (Appendix). In P.R. Keller and M.M. Keller (Eds.), Visual Cues, pp. 187-199

Wehrend, S. and C. Lewis. 1990. A problem-oriented classification of visualization techniques. Proceedings IEEE Visualization '90, 139-143

Wise, J.A., J.J. Thomas, K. Pennock, D. Lantrip, M. Pottier, A. Schur, and V. Crow. 1995. Visualizing the non-visual: spatial analysis and interaction with information from text documents. Proceedings of Information Visualization, October 20-21, 1995. IEEE Computer Society Press, Los Alamitos, CA. 51-58

Zhou, M.X. and S.K. Feiner. 1998. Visual task characterization for automated visual discourse synthesis. CHI ‘98, 392-399