Weak inter-rater reliability in heuristic evaluation of video games

Heuristic evaluation promises to be a low-cost usability evaluation method, but is fraught with problems of subjective interpretation, and a proliferation of competing and contradictory heuristic lists. This is particularly true in the field of games research where no rigorous comparative validation has yet been published. In order to validate the available heuristics, a user test of a commercial game is conducted with 6 participants in which 88 issues are identified, against which 146 heuristics are rated for relevance by 3 evaluators. Weak inter-rater reliability is calculated with Krippendorff's Alpha of 0.343, refuting validation of any of the available heuristics. This weak reliability is due to the high complexity of video games, resulting in evaluators interpreting different reasonable causes and solutions for the issues, and hence the wide variance in their ratings of the heuristics.

[1]  Janne Paavilainen,et al.  Expert review method in game evaluations: comparison of two playability heuristic sets , 2009, MindTrek '09.

[2]  Morten Hertzum,et al.  The evaluator effect in usability tests , 1998, CHI Conference Summary.

[3]  David Pinelle,et al.  Heuristic evaluation for games: usability principles for video game design , 2008, CHI.

[4]  Klaus Kaasgaard,et al.  Comparative usability evaluation , 2004, Behav. Inf. Technol..

[5]  Jakob Nielsen,et al.  Finding usability problems through heuristic evaluation , 1992, CHI.

[6]  Jo Wood,et al.  On the reliability of usability testing , 2001, CHI Extended Abstracts.

[7]  Klaus Krippendorff,et al.  Answering the Call for a Standard Reliability Measure for Coding Data , 2007 .

[8]  Jared M. Spool,et al.  Testing web sites: five users is nowhere near enough , 2001, CHI Extended Abstracts.

[9]  Kasper Hornbæk,et al.  Comparison of techniques for matching of usability problem descriptions , 2008, Interact. Comput..

[10]  Gavriel Salvendy,et al.  What Makes Evaluators to Find More Usability Problems?: A Meta-analysis for Individual Detection Rates , 2007, HCI.

[11]  L. Faulkner Beyond the five-user assumption: Benefits of increased sample sizes in usability testing , 2003, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[12]  Jakob Nielsen,et al.  A mathematical model of the finding of usability problems , 1993, INTERCHI.

[13]  Morten Hertzum,et al.  The Evaluator Effect: A Chilling Fact About Usability Evaluation Methods , 2001, Int. J. Hum. Comput. Interact..

[14]  Melissa A. Federoff,et al.  HEURISTICS AND USABILITY GUIDELINES FOR THE CREATION AND EVALUATION OF FUN IN VIDEO GAMES , 2002 .

[15]  Charlotte Wiberg,et al.  Game Usability Heuristics (PLAY) for Evaluating and Designing Better Games: The Next Iteration , 2009, HCI.

[16]  Morten Hertzum,et al.  The Evaluator Effect in Usability Studies: Problem Detection and Severity Judgments , 1998 .

[17]  Gilbert Cockton,et al.  Understanding Inspection Methods: Lessons from an Assessment of Heuristic Evaluation , 2001, BCS HCI/IHM.

[18]  Jakob Nielsen,et al.  Enhancing the explanatory power of usability heuristics , 1994, CHI '94.

[19]  Jakob Nielsen,et al.  Heuristic evaluation of user interfaces , 1990, CHI '90.

[20]  Gilbert Cockton,et al.  Reconditioned merchandise: extended structured report formats in usability inspection , 2004, CHI EA '04.