Measuring the Usability of Web Sites

Home
Proceedings by Agenda | Proceedings by Author

Conference Proceedings: 4th Conference on Human Factors and the Web

Human Centered Measures of Success in Web Site Design

J. Kirakowski
Human Factors Research Group, University College Cork, Ireland

N. Claridge and R. Whitehand,
Nomos AB, Danderyd, Stockholm, Sweden.

Abstract

Incorporating HCI input into the development of web sites is assumed by the HCI community to be beneficial. But is it? A 60-item questionnaire for evaluating the users’ satisfaction of a web site was developed following a factor model used with success for conventional software evaluation. This questionnaire was shown to have high reliabilities and a database of expected values was gathered for it. It was employed in a study evaluating five web sites in Northern Europe. The results indicate that visitors reported higher satisfaction for the web sites which had some explicit HCI involvement in their development. It is useful to be able to measure the success of HCI input to web site design, and the paper describes the subsequent development of this questionnaire.

Introduction

Although the web teems with helpful advice on how to design a usable web site, little work has been done so far on the issue of how to measure the results of this design activity in an objective manner. User-based subjective evaluation of a concept understood as user satisfaction has been shown to contribute to the successful development and benchmarking of conventional software (see the SUMI web page for reviews and more details - URLs are given at the end of this paper). The authors have developed a similar questionnaire for the evaluation of user satisfaction with web sites.

Preparatory Work

Work on the questionnaire began in Autumn of 1996, after an unsuccessful first attempt to simply modify the SUMI questionnaire (used for evaluating desktop applications). The new questionnaire was code-named WAMMI (Web Analysis and MeasureMent Inventory). The following methodology was used:

Opinions were sought from a large number of designers, users, and web masters about typical positive and negative experiences users encountered when visiting and using web sites.
These statements were content analysed, and then fitted into the SUMI five-factor model of user-perceived satisfaction.
The resulting item bank was blind-sorted by approx 250 students and young professionals with varying degrees of web experience, and it was found the five-factor model was strongly upheld using elementary linkage analysis; some questions with low group cohesion scores were re-worded. There were finally 5 subscales, each corresponding to one usability factor (see below) and 12 items for each subscale.
A series of demographic questions was added to the front of the questionnaire, and space for freeform comments was also included at the end. Responses to items were sought on a seven-point scale. This became the Base 1 Version. You can see it on the WAMMI site.
Base 1 Version was implemented as a web page at the Nomos service providers' site, and also produced as a paper-based questionnaire. Some client sites were in Sweden, so a Swedish version was produced using a rigorous back-translation methodology that the authors had adopted during the development of the SUMI questionnaire (SUMI now works in 12 equivalent language versions).
14 web sites eventually volunteered to be evaluated with the WAMMI questionnaire, producing over 300 user responses.
Reliability estimates from this data bank were calculated, and population parameters (mean and standard deviation) were computed. There seemed to be no difference between the paper and electronic versions, nor between the Swedish and English language versions.
Five sites from this set of 14 were chosen (largely because they were available) to do a followup study of the construct validity of WAMMI, which is reported in this paper.

WAMMI Factors

The five WAMMI factors and their explanations (summarising over the statements from the respective subscales; some example statements are provided) are given as follows:

Attractiveness: degree to which users like the site, whether they find the site pleasant to use. Example statements:

This web site is presented in an attractive way.

You can learn a lot on this web site.

Control: degree to which users feel ‘in charge’, whether the site allows them to navigate through it with ease, and whether the site communicates with them about what it is doing. Example statements:

Going from one part to another is easy on this web site.

I feel in control when I'm using this web site.

Efficiency: degree to which users feel that the site has the information they are looking for, that it works at a reasonable speed and is adapted to their browser. Example statements:

You can find what you want on this web site right away.

This web site works exactly how I would expect it to work.

Helpfulness: degree to which users feel that the site enables them to solve their problems with finding information and navigating. Example statements:

This web site has not been designed to suit its users.

All the parts of this web site are clearly labelled.

Note that one of these statements is a negative opinion; there were more positive than negative opinions in the questionnaire.

Learnability: degree to which users feel they can get to use the site if they come into it for the first time, and the degree to which they feel they can learn to use other facilities or access other information once they have started using it. Example statements:

All the material is written in a way that is easy to understand.

It will be easy to forget how to use this web site.

Population Parameters for Version 1

As a result of analysing the data from all 14 sites, the population parameters are as given in Table 1.

Table 1: WAMMI Parameters, October '97 data

Scale Average StdDv Alpha

Attractiveness 4.79 1.21 .90

Control 4.92 0.80 .70

Efficiency 5.33 1.00 .83

Helpfulness 5.01 1.14 .89

Learnability 5.41 1.10 .86

Total Alpha = 0.96

Note that the averages are all towards the upper ends of the seven-point scale, but that the standard deviations are small enough to give fairly comfortable intervals in the regions above the averages.

Overall, reliabilities for each of the five factors were satisfyingly high: from 0.70 to 0.90 as measured by Cronbach's Alpha coefficient (whose theoretical maximum is usually defined as 1.00.) The total questionnaire produced a high reliability coefficient of 0.96.

Using the above information, it is possible to convert the scale scores for any one web site to a normalised distribution whose average is artificially set at 50, with a standard deviation of 10. This is the numerical format in which the scales scores will be reported from now on in this paper.

Scoring Reports

The data for each site were scored using a standard spreadsheet format. The scoring procedure essentially gives two kinds of information for each site:

Scale Analysis: whether the site being evaluated is above or below the expected (combined data) average of 50 (with a standard deviation of 10);
Item Analysis: for each question, whether the response pattern of users shows that the aspect of satisfaction tapped by the question is better or worse than the overall pattern for that question. Items which are exceptionally better or exceptionally worse than the overall pattern may be taken together to form a composite overview of what are considered to be good and poor features of the site being analysed (the statistical criterion used was the 0.05 critical region of the chi square distribution).

Thus (1) answers questions that relate primarily to summative evaluation, (2) answers questions relating more to formative evaluation.

In addition to the actual questionnaire used, a number of questions were asked the users about their demographic status, the frequency with which they visited the site, their experience of using the web, and so on. These additional questions were used to break down the overall user satisfaction profile to get a closer look at what different categories of users thought about the sites. These results are only commented on where there is a significant difference between the cross-tabulations of WAMMI scales by levels of demographic variables (as before, the 0.05 critical region of the chi square distribution was used).

Sites Evaluated and Results Obtained

The five sites whose results are reported here were small to medium-sized "information" sites with relatively low traffic (typically between 10 and 100 visitors per day). These sites put up a 'WAMMI Button' on their front page, and requested users to click on it once the users had explored the site enough to make an informed assessment of it. Clicking the button loaded the WAMMI questionnaire. There was no control exercised over who could contribute, although with a 60-item questionnaire on the web, users would have to be fairly dedicated to provide all the required data. Informal statistics indicate that between 1% to 5% of visitors (as calculated from hit rates over the time period of the WAMMI button) contributed. Partially completed or spurious data sets (runs of 12 or more ratings at the same level) were ignored.

Site 1

This was the web site of a small Swedish consultancy firm. It was a small to medium-sized site. The home page set out 6 main categories of content on the site and there was almost no cross-linking between sections. All content was reachable in 3 or less steps from the home page. The site was designed to be viewable in a variety of browsers and screen resolutions. The site was in English only, despite a significant Swedish audience. It was developed with input from personnel experienced in human factors for the web.

Table 2: Summary Statistics for Site 1

Scale: Atttract Control Effic Helpf Learnab

Average 51.88 54.09 52.51 52.59 54.27

Median 51.05 53.05 54.21 55.76 56.84

St Dev 6.49 7.69 8.44 9.14 7.13

N = 25

Most users are quite positive and unanimous in their opinions about the site (low standard deviations, good match between Averages and Medians). Control and Learnability are the best dimensions of the site, but all the usability dimensions are well above the general standard of 50.

Some users may have problems with Efficiency and Helpfulness.

Item analysis suggests this is an enjoyable and comprehensible web site, with no ‘hidden features’ that make life for the user difficult, and which keeps the user well informed where they are in terms of navigation. Some doubt is expressed as to whether the site is as constantly updated as it should be, and the icons and graphics are not appreciated as well as the designers may have thought.

Site 2

This was the web site of a European ‘support’ project, aimed at supporting other European IT projects and companies with information about usability tools and methods. It was a medium-sized site, containing a lot of information about various methods, including some quite long documents. The design used frames, presenting an expanding tree-like hierarchical structure on the left size of the browser window. Navigation was possible both through this left frame and the larger content frame on the right, and most content could be reached within a few steps. It was developed with significant human factors input from part of the design team.

Table 3: Summary Statistics for Site 2

Scale: Atttract Control Effic Helpf Learnab

Average 55.02 53.83 52.41 53.99 52.56

Median 53.81 55.14 52.96 53.20 54.58

St Dev 5.96 8.73 5.11 5.70 7.05

N = 12

A strong consensus among the users that all the dimensions of satisfaction are well above average. Although Efficiency does not score as highly as some of the other dimensions, opinions about the site's Efficiency is very consistent over users. Opinions about the Controllability of the site appear to differ (wider spread of scores and greater difference between Average and Median). Users find the site Attractive, and also find it Helpful.

Surprisingly, really heavy web users (data from demographic cross-tabulations, not shown here) find this site works well for them. They are a hard target to satisfy. It would seem that the more you use the site, the more it grows on you.

Item analysis: since this is a page of resources, it’s a good thing that users find the links useful, that they think that one can learn a lot from it and that downloading is easy. However, it appears to be more complex than it needs to be, and users may be under the impression that there are ‘hidden corners’ on the site which cannot be accessed: there seems to be a small problem with remembering how the site is meant to work. Perhaps more attention to the labelling of pages on the site would be a good idea and some more introductory material.

There may be some technical difficulties: the site is seen as changing unpredictably, stopping and starting without warning, and providing the user’s computer with information that cannot be displayed, although in fact users think that all the information is actually accessible.

Site 3

This was the web site of a medium-sized Swedish consultancy firm. It was a very small site, mainly consisting of several pages giving an overview of the company and its latest news. The site required horizontal scrolling at lower screen resolutions. Graphics and text were in some cases rather large, and there was significant use of horizontal lines on some pages. As the site was fairly small, navigation was simple. There was no human factors involvement in the development of this site.

Table 4: Summary Statistics for Site 3

Scale: Atttract Control Effic Helpf Learnab

Average 44.27 49.49 39.39 47.41 49.29

Median 51.05 48.87 44.22 45.50 51.55

St Dev 11.90 5.04 13.66 7.34 5.50

N = 5

The median is usually higher than the average, this is especially noticeable on Attractiveness and Efficiency. This is because some users have rated the site extremely low on these two dimensions, as we were able to notice from the individual user score profiles (not shown).

The site is at the general average rating for Control and Learnability. However, it is significantly below on Attractiveness and Efficiency. There seems to be some difference between visitors 1, 3, and 5, who have a good impression overall, and visitors 2 and 4, especially with regard to Efficiency, which may be bringing the scores on these dimensions down. On a technical note it is good to see that the standard deviations are fairly tight on at least three of the dimensions (those with high reliability estimates) given the small sample size.

Item analysis suggests that the positive aspects of the web site are that it is easy to navigate around it, and that the links are understandable. These sorts of issues would tend to raise its Efficiency and Control. However, there appears to be a lot of ‘clutter’ on the web site caused by extraneous graphics and pictures which are confusing and possibly misleading, the language used is not very comprehensible, and the pages on the site do not have clear titles. There may also be an element of the site being too advanced for the computing capabilities of some of its intended users.

Site 4

This was the web site of a large consultancy company. It included basic information about the company, its services, press releases and partners, and had steadily grown from a small to a somewhat larger site without much change to structure or navigation. Navigation was achieved purely through a column on the left side of pages (not a frame), but returning to previous levels required use of the Back function of the browser (there was no support for return through the navigation levels other than a link directly back to the home page). Navigation was not well structured and was poorly labelled. Some pages displayed poorly on low resolution screens, requiring horizontal scrolling. There was no significant human factors input to the development of this site.

Table 5: Summary Statistics for Site 4, Broken Down by External vs Internal Users

Attract Contr Effic Helpf Learn

External mean 41.96 49.34 49.26 44.14 48.62

sd 9.16 10.25 8.41 11.14 10.57

Internal mean 46.10 50.43 49.89 48.57 50.42

sd 11.56 8.15 8.41 10.12 9.29

Overall mean 43.94 49.86 49.56 46.26 49.48

sd 10.51 9.25 8.35 10.81 9.94

Total N = 67

The site is on the general average for Controllability, Efficiency, and Helpfulness, slightly below for Attractiveness and Helpfulness. However, there is a big difference between the highest and the lowest ratings on each dimension. For the kinds of numbers we have sampled, the standard deviations are quite large, indicating a possibly hetrogenous group of users.

The data can be divided according to whether users came from inside or outside the organisation. The amount of variation in the responses is still quite large, even when the origin of the users is taken into consideration. Internal users prefer the site (Attractiveness) and they find the site more Helpful. Apart from that, there seems to be little difference between these two groups of users.

Item analysis indicates that users find the site works fast enough for their needs, but they do not find the web site attractive or enjoyable, and they think they will get bored with it very quickly and not want to come back. The main problem seems to be navigation, which comes up in several of the items, and this in turn may be leading to problems with picking up the informational content of the site, although users think they can find all the information they are looking for, anyway.

Users who have a lot of experience of other web sites are more critical of this site than users who have little experience of other web sites; and users who have seen this site for the first time or do not visit it often are less critical of it (cross-tabulations not shown).

Site 5

This was the web site of a small research and consultancy group at a university. It was a small to medium-sized site with a significant amount of information on a range of topics to do with usability engineering, well grouped, in 7 main sections. Some pages, whilst intended to be read on-screen, were probably too long for that purpose. Navigation varied (especially for returning up levels) and was unclear on some pages. Presentation format varied between some sections. The site was not developed with any significant human factors input, although the site owners were human factors experts. It appears they handed over the site development to an outside consultant.

Table 6: Summary Statistics for Site 5

Scale: Atttract Control Effic Helpf Leanab

Average 47.59 50.37 52.64 51.04 51.64

Median 49.66 45.73 54.21 51.36 53.82

St Dev 9.63 13.97 7.26 8.09 9.41

N = 9

All the scales with the exception of Attractiveness are on or just over the general average mark. Some users find the site extremely high on Controllability, and Efficiency is certainly its strongest dimension.

The profile hovers around the expected general average of 50. The site is not considered to be especially Attractive, and Efficiency and Learnability are its strongest dimensions. Quite a large spread on the Controllability dimension showing a diversity of opinion, but on a technical point it is noteworthy than even with a small N standard deviations are actually mostly smaller than the population parameter of 10.00.

Item analysis indicates that the links are useful and helpfully labelled, although some users feel that navigation within the site could be improved, and that the site is not internally consistent. It has icons and graphics which are neither very pleasant to look at nor helpful. Most worrying is the fact that the major problem is that copying information from the site is not easy: for an information provider’s site, this could be very damaging.

Discussion

Table 7: Human Factors Sites vs the Rest

Scale: Atttract Control Effic Helpf Learnab

Sites 1 & 2 52.90 54.01 52.48 53.04 53.72

Sites 3, 4 & 5 44.37 49.89 49.27 46.86 49.71

Table 7 shows the weighted averages for the two groups of sites. As we can see, sites 1 and 2 show above average profiles for satisfaction, and these are the sites which had human factors input. Sites 3, 4 and 5 show average or below average profiles for user satisfaction, and these are the sites which did not have human factors input to their design. The differences between averages are approximately half a population standard deviation apart, which given the size of the combined N yield statistically significant differences at least at the 0.05 level.

The results of this experiment indicate that the web usability questionnaire, at least in its 60-item version, is not only a reliable measure, but it displays some concurrent validity, inasmuch as the web sites we would expect to have higher user satisfaction scores, because they have been developed with significant amounts of human factors inputs, do indeed have this kind of profile.

Of course, the users are self-selected and there is a large amount of wastage, ie, partial records from users who ‘gave up’ part-way through, and given the numbers of users who visited the sites, we have only sampled less than 5% of the total number of visitors at each site. Neither is there any control that we have not been subject to malicious attacks, or other more benign sources of interference. However, all these influences should operate evenly over the five sites evaluated, and so constitute the background ‘noise’ to the evaluation. There is strong evidence that filling out a 60-item questionnaire on the web is not a task that the average user can be expected to carry out effortlessly or even willingly.

Future Directions

There are two directions that we have followed since obtaining the results reported here. One direction has been to develop a much shorter, faster questionnaire, based on the present results, that can be delivered effectively through the web, and to which users can respond with greater ease.

Recent data on the short questionnaire shows extremely high reliabilities, certainly comparable to those cited in table 1. This is in itself an achievement, since reliability estimates are adversely affected (for technical statistical reasons) by short questionnaires. We have currently got data from 25 sites for the new short version of WAMMI, and approx. 1,500 user responses. WAMMI is now in a position to be offered commercially as a reliable method for evaluating user responses to web sites.

The other direction is to fortify the 60-item version, by replacing and re-wording some of the few items which did not discriminate well between different kinds of web sites, and to develop this questionnaire as a tool for use in laboratory or controlled circumstances. Although this direction of development has been relegated to second place during the first two quarters of 1998, we expect to be able to offer a reliable commercial tool from it before the end of the year.

Conclusions

However, to be positive about the results: we have shown that

It is possible to measure the user satisfaction of web sites in a naturalistic, cost effective environment
We can compare the perceived usability of different web sites in an objective, quantitative, manner
We can use questionnaire results to provide potential formative evaluation information
We can refer evaluations to de facto benchmarks of the average perceived usability of web sites.

In addition, this study demonstrates that web sites which are developed using human factors input do actually produce higher user satisfaction levels than sites which, however well crafted technically, have not benefited from this kind of input. As a community, we tend to believe this as dogma. The result of the study reported here is not only that this dogma is plausible, but that it is true.

In the end, web site owners will increasingly require proof that the design effort they have paid for has some functional benefit for them. We would argue that subjective user-based testing is a 'must' in a competitive environment like designing for the World Wide Web.

Acknowledgements

The authors gratefully acknowledge the constructive criticism and advice of Mr Jonathan Earthy of Lloyd's Register (UK) who was the first to point out to us the practical significance of the data we had collected. We acknowledge the contribution of the European Commission's BASELINE project in Information Engineering for help in gathering the much-needed standardisation data. Also to the anonymous reviewers of the first draft of this paper who encouraged us to go that little bit further.

URLs

The SUMI homepage is found at:

http://www.ucc.ie/hfrg/questionnaires/sumi

The WAMMI information page is found at:

http://www.ucc.ie/hfrg/questionnaires/wammi

The Nomos homepage is found at:

http://www.nomos.se

URLs to examples of WAMMI questionnaires can be found at the WAMMI information page or at the Nomos homepage.

Scale	Average	StdDv	Alpha
Attractiveness	4.79	1.21	.90
Control	4.92	0.80	.70
Efficiency	5.33	1.00	.83
Helpfulness	5.01	1.14	.89
Learnability	5.41	1.10	.86

Scale:	Atttract	Control	Effic	Helpf	Learnab
Average	51.88	54.09	52.51	52.59	54.27
Median	51.05	53.05	54.21	55.76	56.84
St Dev	6.49	7.69	8.44	9.14	7.13

		Attract	Contr	Effic	Helpf	Learn
External	mean	41.96	49.34	49.26	44.14	48.62
	sd	9.16	10.25	8.41	11.14	10.57
Internal	mean	46.10	50.43	49.89	48.57	50.42
	sd	11.56	8.15	8.41	10.12	9.29
Overall	mean	43.94	49.86	49.56	46.26	49.48
	sd	10.51	9.25	8.35	10.81	9.94