How Effective Are Heuristic Evaluations?
The idea behind discount usability methods, like heuristic evaluations in particular and expert reviews in general, is that it’s better to uncover some usability issues –even if you don’t have the time or budget to test actual users.
That’s because despite the rise of cheaper and faster unmoderated usability testing methods, it still takes a considerable amount of effort to conduct a usability test.
If a few experts can inspect an interface and uncover many or most of the problems users would encounter in less time and for less cost, then why not exploit this method?
But, can we trust heuristic evaluations? How do we know the problems evaluators uncover aren’t just opinions?
Do they uncover legitimate problems with an interface? How many problems are missed? How many problems are false positives?
Heuristic Evaluation and Usability Testing
To help answer these questions, we conducted a heuristic evaluation and usability test to see how the different evaluation methods compared.
We recently reported on a heuristic evaluation of the Budget and Enterprise websites. Four inspectors (2 experts and 2 novices) independently examined each website for issues users might encounter. They were asked to limit their inspections to two tasks (finding a rental location and renting a car).
In total, 22 issues were identified across the four evaluators. How many of these issues would users encounter and what was missed?
Prior to the heuristic evaluation we conducted a usability test on the same websites but didn’t share the results with the evaluators.
In total we had 50 users attempt the same two tasks on both websites. The test was an unmoderated study conducted using userzoom. Each participant was recorded usingUsertesting.com so we could playback all sessions with audio and video to identify usability issues. Two researchers viewed all 50 videos to record usability problems and identified 50 unique issues.
The graph below shows the 22 issues identified by the evaluators and the number and percent of users that encountered the issue.
Figure 1: Problem matrix for Budget.com (“B”) and Enterprise.com (“E”) from four evaluators (E1-E4) and the number and percentage of 50 users who encountered the issue in a usability test.
For example, three evaluators and 24 of the 50 users (48%) on Enterprise had trouble locating the place where rental locations were listed (issue #1 “Locations Placement”).
Two evaluators and 14 users (28%) had a problem with the way the calendar switches from the current month to the next month on Budget.com (issue #16), as shown in the figure below.
Figure 2: When selecting certain return dates, the Budget calendar will switch the placement of the month (notice how October goes from being on the right to the left).
All four evaluators found that adding a GPS to your rental after you added your personal information was confusing on Enterprise.com—an issue 62% of users also had (issue #11).
We found that the evaluators identified 16 of the 50 issues users encountered in the usability test (32% of the total). In terms of false positives, only two of the issues identified by the evaluators weren’t found by any of the 50 users (9%).
How this study compares
There is a rich publication history comparing heuristic evaluations and usability testing. In fact, two of the most influential papers in usability cover usability testing and heuristic evaluations. In examining some of the more recent publications comparing HE and UT we looked for specific examples like our experiment, where the overlap in problems between the two methods is shown.
The table below shows four studies, in addition to the current one, that on average heuristic evaluations find around 36% of the problems in usability tests (ranging from 30% to 43%).
|In HE not UT
|In UT not HE (Misses)||Inspectors||Users|
|Doubleday, et. al.||36%||40%||39%||5||20|
|Law & Hvannberg 2002||30%||38%||32%||2||10|
|Law & Hvannberg 2004||43%||46%||48%||18||19|
|Hvannverg et al. 2006||40%||37%||60%||10||10|
The overlap is called a “hit,” meaning the discounted method hit on the same issue as found in the traditional evaluation method of usability testing.
To get an idea about potential false alarms, we see that on average, 34% of problems identified in Heuristic Evaluations aren’t found by users in a usability test (ranging from 9% to 46%). These have come to be known as “False Alarms,” suggesting these problems would not be encountered by users.
Finally we see on average Heuristic Evaluations miss around 49% of the issues uncovered from watching users in a usability test. Note: The percentages don’t always add up to 100% because different problem numbers are used to derive the percentages.
This study had by far the most users (2.5x more than any other). This likely explains the much lower false alarm rate (9% vs. 34% average) and higher miss rate (68% vs. 49% average). With more users, you increase the chances of seeing new issues and you increase the chances of “hitting” the issues identified by the evaluators.
The lower False alarm rate might also be explained by our task-based inspection approach. Often inspectors aren’t confined to specific tasks when evaluating an interface and detect problems in the less used parts—parts that often aren’t encountered by users in a 30-60 minute study.
This exercise also illustrates the shortcoming of this approach for judging the effectiveness of heuristic evaluations. Just because an issue wasn’t detected in a usability test doesn’t mean it won’t be encountered by a user. In fact, one of the advantages of an expert review is that it uncovers issues that are harder to find in usability tests because users rarely visit enough parts of a website or software application outside of the assigned tasks(s).
What’s more, even 50 users represent less than 1% of the daily number of users on these websites, meaning it’s presumptuous to assume that no users would have the issue. If we don’t see an issue with 50 users we can be 95% confident between 0% and 6% of all users still might encounter it. For example, the two issues found in the heuristic evaluation and not found in our usability test both seem like legitimate issues–with enough users, we’d probably eventually see them.
The most effective approach at uncovering usability problems is to combine both heuristic evaluations and usability testing.
Heuristic evaluations will typically find between 30% and 50% of problems found in a concurrent usability test—a finding also echoed by Law and Hvannberg.
It’s hard to conclude that issues identified in a heuristic evaluation and not in a usability test are “false positives.” It could be that the issues are encountered by fewer users and just weren’t detected with the sample tested. It’s probably better to characterize them as less frequent or “long-tail” issues than false positives.
So how effective are heuristic evaluations? While the question will and should continue to be debated and researched, I like to think of heuristic evaluations like sugary cereal. They provide a quick jolt of insight but should be part of a “balanced” breakfast of usability evaluation methods.