When Null Hypothesis Significance Testing Is Unsuitable for Research: A Reassessment

Null hypothesis significance testing (NHST) has several shortcomings that are likely contributing factors behind the widely debated replication crisis of psychology, cognitive neuroscience and biomedical science in general. We review these shortcomings and suggest that, after about 60 years of negative experience, NHST should no longer be the default, dominant statistical practice of all biomedical and psychological research. Different inferential methods (NHST, likelihood estimation, Bayesian methods, false-discovery rate control) may be most suitable for different types of research questions. Whenever researchers use NHST they should justify its use, and publish pre-study power calculations and effect sizes, including negative findings. Studies should optimally be pre-registered and raw data published. The current statistics lite educational approach for students that has sustained the widespread, spurious use of NHST should be phased out. Instead, we should encourage either more in-depth statistical training of more researchers and/or more widespread involvement of professional statisticians in all research.

[1]  Howard Bowman,et al.  I Tried a Bunch of Things: The Dangers of Unexpected Overfitting in Classification , 2016, bioRxiv.

[2]  J. Shaffer Multiple Hypothesis Testing , 1995 .

[3]  R. Leech,et al.  Neuroadaptive Bayesian Optimization and Hypothesis Testing , 2017, Trends in Cognitive Sciences.

[4]  J. Ioannidis,et al.  Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature , 2017, PLoS biology.

[5]  J. Ioannidis,et al.  Outcome reporting bias in clinical trials: why monitoring matters , 2017, British Medical Journal.

[6]  Thomas E. Nichols,et al.  Best practices in data analysis and sharing in neuroimaging using MRI , 2017, Nature Neuroscience.

[7]  Yolanda Gil,et al.  Enhancing reproducibility for computational methods , 2016, Science.

[8]  Denes Szucs,et al.  A Tutorial on Hunting Statistical Significance by Chasing N , 2016, Front. Psychol..

[9]  Hans Knutsson,et al.  Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates , 2016, Proceedings of the National Academy of Sciences.

[10]  N. Lazar,et al.  The ASA Statement on p-Values: Context, Process, and Purpose , 2016 .

[11]  J. Ioannidis,et al.  Evolution of Reporting P Values in the Biomedical Literature, 1990-2015. , 2016, JAMA.

[12]  J. Vandekerckhove,et al.  A Bayesian Perspective on the Reproducibility Project: Psychology , 2016, PloS one.

[13]  John P. A. Ioannidis,et al.  p-Curve and p-Hacking in Observational Research , 2016, PloS one.

[14]  J. Ioannidis,et al.  Registration practices for observational studies on ClinicalTrials.gov indicated low adherence. , 2016, Journal of clinical epidemiology.

[15]  James O. Berger,et al.  Rejection odds and rejection ratios: A proposal for statistical practice in testing hypotheses , 2015, Journal of mathematical psychology.

[16]  Michèle B. Nuijten,et al.  The prevalence of statistical reporting errors in psychology (1985–2013) , 2015, Behavior Research Methods.

[17]  Jeffrey N. Rouder,et al.  The fallacy of placing confidence in confidence intervals , 2015, Psychonomic bulletin & review.

[18]  Mandy Eberhart Teaching Students To Read , 2016 .

[19]  Doreen Eichel,et al.  Data Analysis A Bayesian Tutorial , 2016 .

[20]  Isabelle Boutron,et al.  Classification and prevalence of spin in abstracts of non-randomized studies evaluating an intervention , 2015, BMC Medical Research Methodology.

[21]  John P A Ioannidis,et al.  Assessment of vibration of effects due to model specification can demonstrate the instability of observational associations. , 2015, Journal of clinical epidemiology.

[22]  Michael C. Frank,et al.  Estimating the reproducibility of psychological science , 2015, Science.

[23]  R. Kaplan,et al.  Likelihood of Null Effects of Large NHLBI Clinical Trials Has Increased over Time , 2015, PloS one.

[24]  Brian A. Nosek,et al.  Promoting an open research culture , 2015, Science.

[25]  Jean-Baptiste Poline,et al.  Improving functional magnetic resonance imaging reproducibility , 2015, GigaScience.

[26]  A. Gelman The Connection Between Varying Treatment Effects and the Crisis of Unreplicable Research , 2015 .

[27]  Carol Jagger,et al.  Assessing the validity of the Global Activity Limitation Indicator in fourteen European countries , 2015, BMC Medical Research Methodology.

[28]  Anders Engberg-Pedersen Empire of chance , 2015 .

[29]  Michèle B. Nuijten,et al.  Statistical Reporting Errors and Collaboration on Statistical Analyses in Psychological Science , 2014, PloS one.

[30]  John P. A. Ioannidis,et al.  Big data meets public health , 2014, Science.

[31]  Sally Hopewell,et al.  Impact of spin in the abstracts of articles reporting results of randomized controlled trials in the field of cancer: the SPIIN randomized controlled trial. , 2014, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[32]  John P. A. Ioannidis,et al.  How to Make More Published Research True , 2014, PLoS medicine.

[33]  Mika Kivimäki,et al.  Don't Let the Truth Get in the Way of a Good Story: An Illustration of Citation Bias in Epidemiologic Research , 2014, American journal of epidemiology.

[34]  John P A Ioannidis,et al.  Placing epidemiological results in the context of multiplicity and typical correlations of exposures , 2014, Journal of Epidemiology & Community Health.

[35]  John P A Ioannidis,et al.  Studying the elusive environment in large scale. , 2014, JAMA.

[36]  Jelle J. Goeman,et al.  Multiple hypothesis testing in genomics , 2014, Statistics in medicine.

[37]  D. Lakens,et al.  Sailing From the Seas of Chaos Into the Corridor of Stability , 2014, Perspectives on psychological science : a journal of the Association for Psychological Science.

[38]  Leif D. Nelson,et al.  P-Curve and Effect Size: Correcting for Publication Bias Using Only Significant Results , 2014 .

[39]  F. De Filippis,et al.  A Selected Core Microbiome Drives the Early Stages of Three Popular Italian Cheese Manufactures , 2014, PloS one.

[40]  John P A Ioannidis,et al.  Improving the drug development process: more not less randomized trials. , 2014, JAMA.

[41]  R. Tibshirani,et al.  Increasing value and reducing waste in research design, conduct, and analysis , 2014, The Lancet.

[42]  Jeffrey N. Rouder,et al.  Robust misinterpretation of confidence intervals , 2013, Psychonomic bulletin & review.

[43]  Leif D. Nelson,et al.  P-Curve: A Key to the File Drawer , 2013, Journal of experimental psychology. General.

[44]  A. Gelman,et al.  The statistical crisis in science , 2014 .

[45]  Andrew Gelman,et al.  Data-dependent analysis—a "garden of forking paths"— explains why many statistically significant comparisons don't hold up. , 2014 .

[46]  I. Kawachi,et al.  Don ' t Let the Truth Get in the Way of a Good Story : An , 2014 .

[47]  Published Online Biomedical research: increasing value, reducing waste , 2014 .

[48]  S. Goodman,et al.  Raw data from clinical trials: within reach? , 2013, Trends in pharmacological sciences.

[49]  Andrew Gelman,et al.  Interrogating p-values , 2013 .

[50]  J. Ioannidis,et al.  Meta-analysis methods for genome-wide association studies and beyond , 2013, Nature Reviews Genetics.

[51]  Brian A. Nosek,et al.  Power failure: why small sample size undermines the reliability of neuroscience , 2013, Nature Reviews Neuroscience.

[52]  T. Perneger,et al.  Citation bias favoring statistically significant studies was present in medical research. , 2013, Journal of clinical epidemiology.

[53]  John P A Ioannidis,et al.  Is everything we eat associated with cancer? A systematic cookbook review. , 2013, The American journal of clinical nutrition.

[54]  G. Cumming The New Statistics: Why and How , 2013 .

[55]  Andrew Gelman,et al.  P values and statistical practice. , 2013, Epidemiology.

[56]  J. Ioannidis Why Science Is Not Necessarily Self-Correcting , 2012, Perspectives on psychological science : a journal of the Association for Psychological Science.

[57]  H. Pashler,et al.  Is the Replicability Crisis Overblown? Three Arguments Examined , 2012, Perspectives on psychological science : a journal of the Association for Psychological Science.

[58]  Joshua Carp,et al.  The secret lives of experiments: Methods reporting in the fMRI literature , 2012, NeuroImage.

[59]  Hans Knutsson,et al.  Does Parametric Fmri Analysis with Spm Yield Valid Results? -an Empirical Study of 1484 Rest Datasets Does Parametric Fmri Analysis with Spm Yield Valid Results? - an Empirical Study of 1484 Rest Datasets , 2022 .

[60]  Jeffrey R. Spies,et al.  Scientific Utopia: II. Restructuring incentives and practices to promote truth over publishability , 2012, 1205.4251.

[61]  G. Loewenstein,et al.  Measuring the Prevalence of Questionable Research Practices With Incentives for Truth Telling , 2012, Psychological science.

[62]  C. Begley,et al.  Drug development: Raise standards for preclinical cancer research , 2012, Nature.

[63]  John P A Ioannidis,et al.  What Should the Genome-wide Significance Threshold Be? Empirical Replication of Borderline Genetic Associations Yfor a Full List of Investigators Offering Data and Clarifications See Acknowledgments , 2022 .

[64]  C. Glenn Begley,et al.  Raise standards for preclinical cancer research , 2012 .

[65]  R. Peng Reproducible Research in Computational Science , 2011, Science.

[66]  Leif D. Nelson,et al.  False-Positive Psychology , 2011, Psychological science.

[67]  J. Ioannidis,et al.  Risk factors and interventions with statistically significant tiny effects. , 2011, International journal of epidemiology.

[68]  J. Ioannidis,et al.  The False-positive to False-negative Ratio in Epidemiologic Studies , 2011, Epidemiology.

[69]  J. Wicherts,et al.  The (mis)reporting of statistical results in psychology journals , 2011, Behavior research methods.

[70]  Wei Liu,et al.  Testing Statistical Hypotheses of Equivalence and Noninferiority, 2nd edn by Stefan Wellek , 2011 .

[71]  E. Wagenmakers,et al.  Why psychologists must change the way they analyze their data: the case of psi: comment on Bem (2011). , 2011, Journal of personality and social psychology.

[72]  F. Godlee,et al.  Wakefield’s article linking MMR vaccine and autism was fraudulent , 2011, BMJ : British Medical Journal.

[73]  B. Deer,et al.  How the case against the MMR vaccine was fixed , 2011, BMJ : British Medical Journal.

[74]  Yoav Benjamini,et al.  Simultaneous and selective inference: Current successes and future challenges , 2010, Biometrical journal. Biometrische Zeitschrift.

[75]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[76]  Peter J Diggle,et al.  Embracing the concept of reproducible research. , 2010, Biostatistics.

[77]  Niels Keiding,et al.  Reproducible research and the substantive context. , 2010, Biostatistics.

[78]  S. Wellek Testing Statistical Hypotheses of Equivalence and Noninferiority , 2010 .

[79]  Douglas G Altman,et al.  Reporting and interpretation of randomized controlled trials with statistically nonsignificant results for primary outcomes. , 2010, JAMA.

[80]  E. Boersma,et al.  Prevention of Catheter-Related Bacteremia with a Daily Ethanol Lock in Patients with Tunnelled Catheters: A Randomized, Placebo-Controlled Trial , 2010, PloS one.

[81]  D. Fanelli Do Pressures to Publish Increase Scientists' Bias? An Empirical Support from US States Data , 2010, PloS one.

[82]  Maarten H. P. Ambaum,et al.  Significance Tests in Climate Science , 2010, 1003.2934.

[83]  Matko Marušić,et al.  Can Teaching Research Methodology Influence Students' Attitude Toward Science? Cohort Study and Nonrandomized Trial in a Single Medical School , 2010, Journal of Investigative Medicine.

[84]  L. Hedges,et al.  The Handbook of Research Synthesis and Meta-Analysis , 2009 .

[85]  Michael B. Miller,et al.  The principled control of false positives in neuroimaging. , 2009, Social cognitive and affective neuroscience.

[86]  Steven A Greenberg,et al.  How citation distortions create unfounded authority: analysis of a citation network , 2009, BMJ : British Medical Journal.

[87]  Roger D Peng,et al.  Reproducible research and Biostatistics. , 2009, Biostatistics.

[88]  Patrick Onghena,et al.  How Confident are Students in their Misconceptions about Hypothesis Tests? , 2009 .

[89]  H. Pashler,et al.  Puzzlingly High Correlations in fMRI Studies of Emotion, Personality, and Social Cognition 1 , 2009, Perspectives on psychological science : a journal of the Association for Psychological Science.

[90]  W. K. Simmons,et al.  Circular analysis in systems neuroscience: the dangers of double dipping , 2009, Nature Neuroscience.

[91]  A. Meysamie,et al.  Teaching critical appraisal and statistics in anesthesia journal club. , 2008, QJM : monthly journal of the Association of Physicians.

[92]  Olle Häggström,et al.  The Cult of Statistical Significance , 2009 .

[93]  Andrew Gelman,et al.  Why We (Usually) Don't Have to Worry About Multiple Comparisons , 2009, 0907.2478.

[94]  J. Ioannidis Why Most Discovered True Associations Are Inflated , 2008, Epidemiology.

[95]  D. Murdoch,et al.  P-Values are Random Variables , 2008 .

[96]  S. Goodman A dirty dozen: twelve p-value misconceptions. , 2008, Seminars in hematology.

[97]  E. Wagenmakers A practical solution to the pervasive problems ofp values , 2007, Psychonomic bulletin & review.

[98]  J. Harnad Trouble with Physics , 2007, 0709.1728.

[99]  P. Donnelly,et al.  Replicating genotype–phenotype associations , 2007, Nature.

[100]  J. Ioannidis,et al.  An exploratory test for an excess of significant findings , 2007, Clinical trials.

[101]  S. Goodman,et al.  Reproducible Research: Moving toward Research the Public Can Really Trust , 2007, Annals of Internal Medicine.

[102]  George Liberopoulos,et al.  Selection in Reported Epidemiological Risks: An Empirical Assessment , 2007, PLoS medicine.

[103]  Wim Van Den Noortgate,et al.  Students’ misconceptions of statistical inference: A review of the empirical evidence from research on statistics education , 2007 .

[104]  R. Poldrack Can cognitive processes be inferred from neuroimaging data? , 2006, Trends in Cognitive Sciences.

[105]  J. Ioannidis Why Most Published Research Findings Are False , 2005, PLoS medicine.

[106]  Gerd Gigerenzer,et al.  “A 30% Chance of Rain Tomorrow”: How Does the Public Understand Probabilistic Weather Forecasts? , 2005, Risk analysis : an official publication of the Society for Risk Analysis.

[107]  I. Hozo,et al.  Evaluation of new treatments in radiation oncology: are they better than standard treatments? , 2005, JAMA.

[108]  Niels G. Waller,et al.  The fallacy of the null hypothesis in soft psychology , 2004 .

[109]  F. Roe,et al.  The Empire , 2004, Calixtus II (1119-1124): A Pope Born to Rule.

[110]  R. D. Rosenkrantz,et al.  The significance test controversy , 1972, Synthese.

[111]  David Kaplan,et al.  The Sage handbook of quantitative methodology for the social sciences , 2004 .

[112]  Gerd Gigerenzer,et al.  Do Studies of Statistical Power Have an Effect on the Power of Studies? , 2004 .

[113]  G. Gigerenzer Mindless statistics , 2004 .

[114]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[115]  G. Gigerenzer,et al.  The null ritual : What you always wanted to know about significance testing but were afraid to ask , 2004 .

[116]  Matko Marušić,et al.  Teaching Students How to Read and Write Science: A Mandatory Course on Scientific Research and Communication in Medicine , 2003, Academic medicine : journal of the Association of American Medical Colleges.

[117]  Thomas E. Nichols,et al.  Controlling the familywise error rate in functional neuroimaging: a comparative review , 2003, Statistical methods in medical research.

[118]  M. J. Bayarri,et al.  Confusion Over Measures of Evidence (p's) Versus Errors (α's) in Classical Statistical Testing , 2003 .

[119]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[120]  M. Tribus,et al.  Probability theory: the logic of science , 2003 .

[121]  Jonathan A C Sterne,et al.  Teaching hypothesis tests – time for significant change? , 2002, Statistics in medicine.

[122]  C. Gluud,et al.  Citation bias of hepato-biliary randomized clinical trials. , 2002, Journal of clinical epidemiology.

[123]  N. Leech,et al.  Problems With Null Hypothesis Significance Testing (NHST): What Do the Textbooks Say? , 2002 .

[124]  L. HARKing: Hypothesizing After the Results are Known , 2002 .

[125]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[126]  G A Morgan,et al.  Problems with null hypothesis significance testing. , 2001, Journal of the American Academy of Child and Adolescent Psychiatry.

[127]  M. J. Bayarri,et al.  Calibration of ρ Values for Testing Precise Null Hypotheses , 2001 .

[128]  Jonathan A C Sterne,et al.  Sifting the evidence—what's wrong with significance tests? , 2001, BMJ : British Medical Journal.

[129]  D Curran-Everett,et al.  Multiple comparisons: philosophies and illustrations. , 2000, American journal of physiology. Regulatory, integrative and comparative physiology.

[130]  R. Nickerson,et al.  Null hypothesis significance testing: a review of an old and continuing controversy. , 2000, Psychological methods.

[131]  Y. Lee An Empirical Assessment , 2000 .

[132]  Francis Tuerlinckx,et al.  Type S error rates for classical and Bayesian single and multiple comparison procedures , 2000, Comput. Stat..

[133]  D. Krantz The Null Hypothesis Testing Controversy in Psychology , 1999 .

[134]  S. Goodman Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy , 1999, Annals of Internal Medicine.

[135]  Gerd Gigerenzer We need statistical thinking, not statistical rituals , 1998, Behavioral and Brain Sciences.

[136]  M. Olson,et al.  Misconceptions About Sample Size, Statistical Significance, and Treatment Effect , 1997 .

[137]  W. Johnson,et al.  A Bayesian perspective on the Bonferroni adjustment , 1997 .

[138]  R T O'Neill,et al.  The behavior of the P-value when the alternative hypothesis is true. , 1997, Biometrics.

[139]  J. Hunter Needed: A Ban on the Significance Test , 1997 .

[140]  F. Schmidt Statistical Significance Testing and Cumulative Knowledge in Psychology: Implications for Training of Researchers , 1996 .

[141]  R. Rosenthal,et al.  Statistical power: concepts, procedures, and applications. , 1996, Behaviour research and therapy.

[142]  Theodor D. Sterling,et al.  Publication decisions revisited: the effect of the outcome of statistical tests on the decision to p , 1995 .

[143]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[144]  Jacob Cohen The earth is round (p < .05) , 1994 .

[145]  D L DeMets,et al.  Interim analysis: the alpha spending function approach. , 1994, Statistics in medicine.

[146]  D. Moher,et al.  Statistical power, sample size, and their reporting in randomized controlled trials. , 1994, JAMA.

[147]  R. P. Carver The Case Against Statistical Significance Testing, Revisited , 1993 .

[148]  S. Goodman,et al.  p values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. , 1993, American journal of epidemiology.

[149]  D. Lindley,et al.  The Analysis of Experimental Data: The Appreciation of Tea and Wine , 1993 .

[150]  Frank L. Schmidt,et al.  What do data really mean? Research findings, meta-analysis, and cumulative knowledge in psychology. , 1992 .

[151]  J. Rossi,et al.  Statistical power of psychological research: what have we gained in 20 years? , 1990, Journal of consulting and clinical psychology.

[152]  P. Meehl Why Summaries of Research on Psychological Theories are Often Uninterpretable , 1990 .

[153]  G. Guyatt,et al.  Measurement of health status. Ascertaining the minimal clinically important difference. , 1989, Controlled clinical trials.

[154]  G. Gigerenzer,et al.  Do studies of statistical power have an effect on the power of studies , 1989 .

[155]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[156]  R. Duncan Luce,et al.  The Tools-to-Theory Hypothesis. , 1988 .

[157]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[158]  J. Berger,et al.  Testing Precise Hypotheses , 1987 .

[159]  Jeanette G. Grasselli,et al.  “On the Relative Motion of the Earth and the Luminiferous Ether” , 1987 .

[160]  P. Pollard,et al.  On the probability of making Type I errors. , 1987 .

[161]  J. Berger,et al.  Testing a Point Null Hypothesis: The Irreconcilability of P Values and Evidence , 1987 .

[162]  G. Gigerenzer,et al.  Cognition as Intuitive Statistics , 1987 .

[163]  M. Oakes Statistical Inference: A Commentary for the Social and Behavioural Sciences , 1986 .

[164]  James O. Berger,et al.  Statistical Decision Theory and Bayesian Analysis, Second Edition , 1985 .

[165]  P. Meehl Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. , 1978 .

[166]  Warren J. Ewens,et al.  Likelihood: An account of the statistical concept of likelihood and its application to scientific inference. , 1973 .

[167]  D. Lykken Statistical significance in psychological research. , 1968, Psychological bulletin.

[168]  P. Meehl Theory-Testing in Psychology and Physics: A Methodological Paradox , 1967, Philosophy of Science.

[169]  D. Bakan,et al.  The test of significance in psychological research. , 1966, Psychological bulletin.

[170]  The British Journal for the Philosophy of Science , 1957, Nature.

[171]  Cherry Ann Clark Chapter I: Hypothesis Testing in Relation to Statistical Methodology , 1963 .

[172]  Cherry Ann Clark Hypothesis Testing in Relation to Statistical Methodology , 1963 .

[173]  Jacob Cohen,et al.  The statistical power of abnormal-social psychological research: a review. , 1962, Journal of abnormal and social psychology.

[174]  Jum C. Nunnally,et al.  The Place of Statistics in Psychology , 1960 .

[175]  W. W. Rozeboom The fallacy of the null-hypothesis significance test. , 1960, Psychological bulletin.

[176]  H. Eysenck,et al.  The concept of statistical significance and the controversy about one-tailed tests. , 1960, Psychological review.

[177]  Walter L. Smith Probability and Statistics , 1959, Nature.

[178]  T. Sterling Publication Decisions and their Possible Effects on Inferences Drawn from Tests of Significance—or Vice Versa , 1959 .

[179]  M. S. Bartlett,et al.  Statistical methods and scientific inference. , 1957 .

[180]  H. B. Webb The measurement of health. , 1956, A.M.A. archives of industrial health.

[181]  Frank Yates,et al.  The Influence of Statistical Methods for Research Workers on the Development of the Science of Statistics , 1951 .

[182]  Taylor Francis Online,et al.  The American statistician , 1947 .

[183]  Joseph Berkson,et al.  Some Difficulties of Interpretation Encountered in the Application of the Chi-Square Test , 1938 .

[184]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[185]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[186]  L. M. M.-T. Theory of Probability , 1929, Nature.

[187]  Roland P. Falkner,et al.  History of statistics , 1891 .

[188]  A. Michelson,et al.  On the relative motion of the Earth and the luminiferous ether , 1887, American Journal of Science.