HCI Statistics without p-values

Statistics are tools to help end users (here,researchers) accomplish their task (advance scientific knowledge). Science is a collective and cumulative enterprise, so to be qualified as usable, statistical tools should support and promote clear thinking as well as clear and truthful communication. Yet areas such as human-computer interaction (HCI) have adopted tools – i.e., p-values and statistical significance testing – that have proven to be quite poor at supporting these tasks. The use and misuse of p-values and significance testing has been severely criticized in a range of disciplines for several decades, suggesting that tools should be blamed, not end users. This article explains why it would be beneficial for HCI to switch from statistical significance testing to estimation, i.e., reporting informative charts with effect sizes and confidence intervals, and offering nuanced interpretations of our results. Advice is offered on how to communicate our empirical results in a clear, accurate, and transparent way without using any p-value.

[1]  Chris North,et al.  The Value of Information Visualization , 2008, Information Visualization.

[2]  R. Rosenthal,et al.  Statistical Procedures and the Justification of Knowledge in Psychological Science , 1989 .

[3]  Leland Wilkinson,et al.  Statistical Methods in Psychology Journals Guidelines and Explanations , 2005 .

[4]  N. Leech,et al.  Problems With Null Hypothesis Significance Testing (NHST): What Do the Textbooks Say? , 2002 .

[5]  D. Mccloskey,et al.  The Cult of Statistical Significance , 2009 .

[6]  John W. Tukey,et al.  We Need Both Exploratory and Confirmatory , 1980 .

[7]  G. Cumming,et al.  Inference by eye: confidence intervals and how to read pictures of data. , 2005, The American psychologist.

[8]  T. Levine,et al.  A Critical Assessment of Null Hypothesis Significance Testing in Quantitative Communication Research , 2008 .

[9]  A. Gelman,et al.  The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant , 2006 .

[10]  Heike Hofmann,et al.  Graphical Tests for Power Comparison of Competing Designs , 2012, IEEE Transactions on Visualization and Computer Graphics.

[11]  Kees van Deemter Not Exactly: In Praise of Vagueness , 2010 .

[12]  Yvonne Jansen,et al.  Physical and tangible information visualization , 2014 .

[13]  G. Cumming,et al.  The New Statistics , 2014, Psychological science.

[14]  G. Loftus,et al.  Why Figures with Error Bars Should Replace p Values Some Conceptual Arguments and Empirical Demonstrations , 2015 .

[15]  Colin Potts,et al.  Design of Everyday Things , 1988 .

[16]  Kris N Kirby,et al.  BootES: An R package for bootstrap confidence intervals on effect sizes , 2013, Behavior research methods.

[17]  Andrew Gelman,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models , 2006 .

[18]  W. Wilson,et al.  A note on the incosistency inherent in the necessity to perform multiple comparisons. , 1962, Psychological bulletin.

[19]  Lisa L. Harlow,et al.  Eight Common but False Objections to the Discontinuation of Significance Testing in the Analysis of Research Data , 2016 .

[20]  Brian J Scholl,et al.  Bar graphs depicting averages are perceptually misinterpreted: The within-the-bar bias , 2012, Psychonomic bulletin & review.

[21]  Rocco J. Perla,et al.  Ten Common Misunderstandings, Misconceptions, Persistent Myths and Urban Legends about Likert Scales and Likert Response Formats and their Antidotes , 2007 .

[22]  R. Newcombe,et al.  Interval estimation for the difference between independent proportions: comparison of eleven methods. , 1998, Statistics in medicine.

[23]  D. Vaux,et al.  Error bars in experimental biology , 2007, The Journal of Cell Biology.

[24]  Robert Rosenthal,et al.  The effect of experimenter bias on the performance of the albino rat. , 2007 .

[25]  Andrew Gelman,et al.  Interrogating p-values , 2013 .

[26]  Pierre Dragicevic,et al.  The Not-so-Staggering Effect of Staggered Animated Transitions on Visual Tracking , 2014, IEEE Transactions on Visualization and Computer Graphics.

[27]  Rand R. Wilcox,et al.  How many discoveries have been lost by ignoring modern statistical methods , 1998 .

[28]  S. Goodman Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy , 1999, Annals of Internal Medicine.

[29]  J. Kruschke Doing Bayesian Data Analysis: A Tutorial with R and BUGS , 2010 .

[30]  O. Keene,et al.  The log transformation is special. , 1995, Statistics in medicine.

[31]  Gordon B Drummond,et al.  Show the data, don't conceal them. , 2011, Advances in physiology education.

[32]  Fiona Fidler,et al.  STATISTICAL COGNITION: TOWARDS EVIDENCE-BASED PRACTICE IN STATISTICS AND STATISTICS EDUCATION , 2008 .

[33]  Nina Mazar,et al.  The Dishonesty of Honest People: A Theory of Self-Concept Maintenance , 2008 .

[34]  John Berry,et al.  Teaching Statistics through Resampling , 1994 .

[35]  Ulf-Dietrich Reips,et al.  Interval-level measurement with visual analogue scales in Internet-based research: VAS Generator , 2008, Behavior research methods.

[36]  Bruce Thompson,et al.  Statistical Significance and Effect Size Reporting: Portrait of a Possible Future. , 1998 .

[37]  J. Ioannidis Why Most Published Research Findings Are False , 2019, CHANCE.

[38]  Shmuel Pietrokovski,et al.  Breaking up is hard to do , 1998, Nature Structural Biology.

[39]  P. Meehl Theory-Testing in Psychology and Physics: A Methodological Paradox , 1967, Philosophy of Science.

[40]  G. Cumming Replication and p Intervals: p Values Predict the Future Only Vaguely, but Confidence Intervals Do Much Better , 2008, Perspectives on psychological science : a journal of the Association for Psychological Science.

[41]  Jacob Cohen,et al.  THINGS I HAVE LEARNED (SO FAR) , 1990 .

[42]  Kristopher J Preacher,et al.  On the practice of dichotomization of quantitative variables. , 2002, Psychological methods.

[43]  Heiko Haller,et al.  Misinterpretations of significance: A problem students share with their teachers? , 2002 .

[44]  Geoff Cumming,et al.  Inference by eye: Reading the overlap of independent confidence intervals , 2009, Statistics in medicine.

[45]  Rex B. Kline,et al.  What's Wrong With Statistical Tests--And Where We Go From Here. , 2004 .

[46]  Jeff Sauro,et al.  Average task times in usability tests: what to report? , 2010, CHI 2010.

[47]  P. Pollard,et al.  On the probability of making Type I errors. , 1987 .

[48]  Stanley E Lazic,et al.  The problem of pseudoreplication in neuroscientific studies: is it affecting your analysis? , 2010, BMC Neuroscience.

[49]  M. Brewer,et al.  Research Design and Issues of Validity , 2000 .

[50]  J. Rossi,et al.  Statistical power of psychological research: what have we gained in 20 years? , 1990, Journal of consulting and clinical psychology.

[51]  T. Perneger What's wrong with Bonferroni adjustments , 1998, BMJ.

[52]  Leif D. Nelson,et al.  False-Positive Psychology , 2011, Psychological science.

[53]  Bruce Thompson,et al.  Statistical Significance Tests, Effect Size Reporting and the Vain Pursuit of Pseudo-Objectivity , 1999 .

[54]  E. Eich Business Not as Usual , 2014, Psychological science.

[55]  R. Frick,et al.  Interpreting statistical testing: Process and propensity, not population and random sampling , 1998 .

[56]  Geoff Cumming,et al.  Teaching Confidence Intervals: Problems and Potential Solutions , 2005 .

[57]  R. Newcombe Two-sided confidence intervals for the single proportion: comparison of seven methods. , 1998, Statistics in medicine.

[58]  Roger E. Kirk,et al.  Promoting Good Statistical Practices: Some Suggestions , 2001 .

[59]  Pierre Dragicevic,et al.  Running an HCI experiment in multiple parallel universes , 2014, CHI Extended Abstracts.

[60]  Howard Wainer How to Display Data Badly , 1984 .

[61]  S. Lange,et al.  Adjusting for multiple testing--when and how? , 2001, Journal of clinical epidemiology.

[62]  R. Giner-Sorolla,et al.  Science or Art? How Aesthetic Standards Grease the Way Through the Publication Bottleneck but Undermine Science , 2012, Perspectives on psychological science : a journal of the Association for Psychological Science.

[63]  M. Wood Bootstrapped Confidence Intervals as an Approach to Statistical Inference , 2005 .

[64]  Jason W. Osborne,et al.  The power of outliers (and why researchers should ALWAYS check for them) , 2004 .

[65]  Pierre Dragicevic,et al.  My Technique is 20% Faster: Problems with Reports of Speed Improvements in HCI , 2012 .

[66]  Michael Gleicher,et al.  Error Bars Considered Harmful: Exploring Alternate Encodings for Mean and Error , 2014, IEEE Transactions on Visualization and Computer Graphics.

[67]  Regina Nuzzo,et al.  Scientific method: Statistical errors , 2014, Nature.

[68]  Michael Wood,et al.  Statistical inference using bootstrap confidence intervals , 2004 .

[69]  T. Levine,et al.  A Communication Researchers' Guide to Null Hypothesis Significance Testing and Alternatives. , 2008 .

[70]  Tobias Isenberg,et al.  Lightweight Relief Shearing for Enhanced Terrain Perception on Interactive Maps , 2015, CHI.

[71]  Geoffrey R. Loftus,et al.  Standard errors and confidence intervals in within-subjects designs: Generalizing Loftus and Masson (1994) and avoiding the biases of alternative accounts , 2012, Psychonomic Bulletin & Review.

[72]  N. Kerr HARKing: Hypothesizing After the Results are Known , 1998, Personality and social psychology review : an official journal of the Society for Personality and Social Psychology, Inc.

[73]  Jacob Cohen The earth is round (p < .05) , 1994 .

[74]  Jeff Miller,et al.  Short Report: Reaction Time Analysis with Outlier Exclusion: Bias Varies with Sample Size , 1991, The Quarterly journal of experimental psychology. A, Human experimental psychology.

[75]  Kim J. Vicente,et al.  The Earth is spherical (p < 0.05): alternative methods of statistical inference , 2000 .

[76]  Charles Perin,et al.  Revisiting Bertin Matrices: New Interactions for Crafting Tabular Visualizations , 2014, IEEE Transactions on Visualization and Computer Graphics.

[77]  Robert Rosenthal,et al.  Artifacts in Behavioral Research , 2009 .

[78]  Kenneth A. Lachlan,et al.  The high cost of complexity in experimental design and data analysis: Type I and type II error rates in multiway ANOVA , 2002 .

[79]  Abel Brodeur,et al.  Star Wars: The Empirics Strike Back , 2012, SSRN Electronic Journal.

[80]  M. Gardner,et al.  Confidence intervals rather than P values: estimation rather than hypothesis testing. , 1986, British medical journal.

[81]  David W. McDonald,et al.  Interaction Is the Future of Computing , 2007 .

[82]  Jakob Grue Simonsen,et al.  Is once enough?: on the extent and content of replications in human-computer interaction , 2014, CHI.

[83]  Judy Robertson,et al.  Rethinking statistical analysis methods for CHI , 2012, CHI.

[84]  Andrew Gelman,et al.  P values and statistical practice. , 2013, Epidemiology.

[85]  Leonard R. Sussman,et al.  Nominal, Ordinal, Interval, and Ratio Typologies are Misleading , 1993 .

[86]  Lisa Stryjewski,et al.  40 years of boxplots , 2010 .

[87]  Ronald A. Rensink On the Prospects for a Science of Visualization , 2014, Handbook of Human Centric Visualization.

[88]  R. Ulrich,et al.  Effects of truncation on reaction time analysis. , 1994, Journal of experimental psychology. General.

[89]  E. Wagenmakers,et al.  Toward evidence‐based medical statistics: a Bayesian analysis of double‐blind placebo‐controlled antidepressant trials in the treatment of anxiety disorders , 2016, International journal of methods in psychiatric research.

[90]  Jeffery. M. Zacks,et al.  Bars and lines: A study of graphic communication , 1999, Memory & cognition.

[91]  Carlos Eduardo Scheidegger,et al.  An Algebraic Process for Visualization Design , 2014, IEEE Transactions on Visualization and Computer Graphics.