The World of Research Has Gone Berserk: Modeling the Consequences of Requiring “Greater Statistical Stringency” for Scientific Publication

ABSTRACT In response to growing concern about the reliability and reproducibility of published science, researchers have proposed adopting measures of “greater statistical stringency,” including suggestions to require larger sample sizes and to lower the highly criticized “p < 0.05” significance threshold. While pros and cons are vigorously debated, there has been little to no modeling of how adopting these measures might affect what type of science is published. In this article, we develop a novel optimality model that, given current incentives to publish, predicts a researcher’s most rational use of resources in terms of the number of studies to undertake, the statistical power to devote to each study, and the desirable prestudy odds to pursue. We then develop a methodology that allows one to estimate the reliability of published research by considering a distribution of preferred research strategies. Using this approach, we investigate the merits of adopting measures of “greater statistical stringency” with the goal of informing the ongoing debate.

[1]  J. Ioannidis,et al.  Obtaining evidence by a single well-powered trial or several modestly powered trials , 2016, Statistical methods in medical research.

[2]  A. Greenwald Consequences of Prejudice Against the Null Hypothesis , 1975 .

[3]  Brian A. Nosek,et al.  Power failure: why small sample size undermines the reliability of neuroscience , 2013, Nature Reviews Neuroscience.

[4]  Maxine B. Najle,et al.  A Powerful Nudge? Presenting Calculable Consequences of Underpowered Research Shifts Incentives Toward Adequately Powered Designs , 2015 .

[5]  J. Ioannidis,et al.  Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature , 2017, PLoS biology.

[6]  Marcus R. Munafò,et al.  Current Incentives for Scientists Lead to Underpowered Studies with Erroneous Conclusions , 2016, PLoS biology.

[7]  K. Munkittrick,et al.  Statistical reporting deficiencies in environmental toxicology , 2013, Environmental toxicology and chemistry.

[8]  J. Wicherts,et al.  The Rules of the Game Called Psychological Science , 2012, Perspectives on psychological science : a journal of the Association for Psychological Science.

[9]  Camilla L. Nord,et al.  Power-up: A Reanalysis of 'Power Failure' in Neuroscience Using Mixture Modeling , 2017, The Journal of Neuroscience.

[10]  K. Schulz,et al.  Sample size calculations in randomised trials: mandatory and mystical , 2005, The Lancet.

[11]  Iztok Hozo,et al.  When Should Potentially False Research Findings Be Considered Acceptable? , 2007, PLoS medicine.

[12]  David van Dijk,et al.  Publication metrics and success on the academic job market , 2014, Current Biology.

[13]  Christian D. Schunn,et al.  Social Biases and Solutions for Procedural Objectivity , 2011, Hypatia.

[14]  K. Fiedler What Constitutes Strong Psychological Science? The (Neglected) Role of Diagnosticity and A Priori Theorizing , 2017, Perspectives on psychological science : a journal of the Association for Psychological Science.

[15]  Philippe Ravaud,et al.  Reporting of sample size calculation in randomised controlled trials: review , 2009, BMJ : British Medical Journal.

[16]  Eileen M. Trauth,et al.  The Credibility Crisis in IS , 2010, AMCIS.

[17]  Muriel Niederle,et al.  Pre-analysis Plans Have Limited Upside, Especially Where Replications Are Feasible , 2015 .

[18]  Nicholas P. Holmes,et al.  Justify your alpha , 2018, Nature Human Behaviour.

[19]  Hristos Doucouliagos,et al.  Could It Be Better to Discard 90% of the Data? A Statistical Paradox , 2010 .

[20]  Michèle B. Nuijten,et al.  Why Publishing Everything Is More Effective than Selective Publishing of Statistically Significant Results , 2014, PloS one.

[21]  B. Cohen,et al.  How should novelty be valued in science? , 2017, eLife.

[22]  A. M. Walker Low power and striking results--a surprise but not a paradox. , 1995, The New England journal of medicine.

[23]  John T. Wixted,et al.  The Prior Odds of Testing a True Effect in Cognitive and Social Psychology , 2018 .

[24]  D. Lakens,et al.  Statistical power of clinical trials increased while effect size remained stable: an empirical analysis of 136,212 clinical trials between 1975 and 2014. , 2018, Journal of clinical epidemiology.

[25]  David Gal,et al.  Abandon Statistical Significance , 2017, The American Statistician.

[26]  Jeffrey T. Leek,et al.  Is most published research really false? , 2016, bioRxiv.

[27]  Taylor Francis Online,et al.  The American statistician , 1947 .

[28]  Matthew C. Makel,et al.  Replications in Psychology Research , 2012, Perspectives on psychological science : a journal of the Association for Psychological Science.

[29]  K. T. Ten Hagen Novel or reproducible: That is the question. , 2016, Glycobiology.

[30]  Amy L. Mclaughlin In pursuit of resistance: pragmatic recommendations for doing science within one’s means , 2011 .

[31]  I. Cockburn,et al.  The Economics of Reproducibility in Preclinical Research , 2015, PLoS biology.

[32]  Björn Brembs,et al.  Deep impact: unintended consequences of journal rank , 2013, Front. Hum. Neurosci..

[33]  J. Ioannidis,et al.  Reproducibility in Science: Improving the Standard for Basic and Preclinical Research , 2015, Circulation research.

[34]  Sander Greenland,et al.  ASSESSING THE UNRELIABILITY OF THE MEDICAL LITERATURE: A RESPONSE TO "WHY MOST PUBLISHED RESEARCH FINDINGS ARE FALSE" , 2007 .

[35]  Andy Wai Kan Yeung Do Neuroscience Journals Accept Replications? A Survey of Literature , 2017, Front. Hum. Neurosci..

[36]  Christopher D. Chambers,et al.  Redefine statistical significance , 2017, Nature Human Behaviour.

[37]  F. Fidler,et al.  Are Psychology Journals Anti-replication? A Snapshot of Editorial Practices , 2017, Front. Psychol..

[38]  Russell V. Lenth,et al.  Some Practical Guidelines for Effective Sample Size Determination , 2001 .

[39]  John P. A. Ioannidis,et al.  Empirical evidence for low reproducibility indicates low pre-study odds , 2013, Nature Reviews Neuroscience.

[40]  Bengt Fadeel,et al.  Freewheelin’ scientists: citing Bob Dylan in the biomedical literature , 2015, BMJ : British Medical Journal.

[41]  C. Begley,et al.  Drug development: Raise standards for preclinical cancer research , 2012, Nature.

[42]  Roch Giorgi,et al.  Reproducibility issues in science, is P value really the only answer? , 2014, Proceedings of the National Academy of Sciences.

[43]  John P A Ioannidis,et al.  Optimal type I and type II error pairs when the available sample size is fixed. , 2013, Journal of clinical epidemiology.

[44]  J. Ioannidis Why Most Discovered True Associations Are Inflated , 2008, Epidemiology.

[45]  Jacob Cohen,et al.  The statistical power of abnormal-social psychological research: a review. , 1962, Journal of abnormal and social psychology.

[46]  Yongyue Wei,et al.  Lowering the P Value Threshold. , 2018, JAMA.

[47]  C. F. Bond,et al.  One Hundred Years of Social Psychology Quantitatively Described , 2003 .

[48]  Lee Hooper,et al.  Why Are Medical and Health-Related Studies Not Being Published? A Systematic Review of Reasons Given by Investigators , 2014, PloS one.

[49]  Lisa Bero,et al.  Measuring the effectiveness of scientific gatekeeping , 2014, Proceedings of the National Academy of Sciences.

[50]  E. Fess,et al.  Determining sample size. , 1995, Journal of hand therapy : official journal of the American Society of Hand Therapists.

[51]  Michael J. Schell,et al.  Optimism bias leads to inconclusive results-an empirical study. , 2011, Journal of clinical epidemiology.

[52]  Arndt Bröder,et al.  Result-Blind Peer Reviews and Editorial Decisions A Missing Pillar of Scientific Culture , 2013 .

[53]  V. Johnson Revised standards for statistical evidence , 2013, Proceedings of the National Academy of Sciences.

[54]  J. Ioannidis Lowering the P Value Threshold-Reply. , 2018, JAMA.

[55]  Thomas Boraud,et al.  Low statistical power in biomedical science: a review of three human research domains , 2017, Royal Society Open Science.

[56]  M. J. Bayarri,et al.  Confusion Over Measures of Evidence (p's) Versus Errors (α's) in Classical Statistical Testing , 2003 .

[57]  Carlos Guestrin,et al.  Over-optimization of academic publishing metrics: observing Goodhart’s Law in action , 2018, GigaScience.

[58]  S. Orgel The illusion of power , 1975 .

[59]  J. Carlin,et al.  Beyond Power Calculations , 2014, Perspectives on psychological science : a journal of the Association for Psychological Science.

[60]  N. Gogtay,et al.  Biostatistics Series Module 5: Determining Sample Size , 2016, Indian journal of dermatology.

[61]  David M. Lane,et al.  Estimating effect size: Bias resulting from the significance criterion in editorial decisions , 1978 .

[62]  J. Karlawish,et al.  The continuing unethical conduct of underpowered clinical trials. , 2002, JAMA.

[63]  Carmine Zoccali,et al.  Sample Size Calculations , 2011, Nephron Clinical Practice.

[64]  M. Khoury,et al.  Most Published Research Findings Are False—But a Little Replication Goes a Long Way , 2007, PLoS medicine.

[65]  V. Johnson Reply to Gelman, Gaudart, Pericchi: More reasons to revise standards for statistical evidence , 2014, Proceedings of the National Academy of Sciences.

[66]  Michael C. Frank,et al.  Estimating the reproducibility of psychological science , 2015, Science.

[67]  Riender Happee,et al.  Why Selective Publication of Statistically Significant Results Can Be Effective , 2013, PloS one.

[68]  Jeff Miller,et al.  Optimizing Research Payoff , 2016, Perspectives on psychological science : a journal of the Association for Psychological Science.

[69]  F. Korner‐Nievergelt,et al.  The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research , 2017, PeerJ.

[70]  J. Ioannidis Why Most Published Research Findings Are False , 2019, CHANCE.

[71]  Neil Thomason,et al.  Impact of Criticism of Null‐Hypothesis Significance Testing on Statistical Reporting Practices in Conservation Biology , 2006, Conservation biology : the journal of the Society for Conservation Biology.

[72]  Paul Gustafson,et al.  Conditional equivalence testing: An alternative remedy for publication bias , 2017, PloS one.

[73]  Daniele Fanelli,et al.  Negative results are disappearing from most disciplines and countries , 2011, Scientometrics.

[74]  James Fiedler,et al.  Justifying small-n research in scientifically amazing settings: challenging the notion that only "big-n" studies are worthwhile. , 2014, Journal of applied physiology.

[75]  E. Maskin,et al.  The Simple Economics of Research Portfolios , 1987 .

[76]  Luis Carlos Silva Ayçaguer,et al.  Explicación del tamaño muestral empleado: una exigencia irracional de las revistas biomédicas , 2013 .

[77]  Richard McElreath,et al.  The natural selection of bad science , 2016, Royal Society Open Science.

[78]  Pia Rotshtein,et al.  Registered Reports: Realigning incentives in scientific publishing , 2015, Cortex.

[79]  Joseph F. Mudge,et al.  Setting an Optimal α That Minimizes Errors in Null Hypothesis Significance Tests , 2012, PloS one.

[80]  Gideon Nave,et al.  Evaluating replicability of laboratory experiments in economics , 2016, Science.

[81]  Andrew Gelman,et al.  The illusion of power: How the statistical significance filter leads to overconfident expectations of replicability , 2017 .

[82]  John P A Ioannidis,et al.  Translation of highly promising basic science research into clinical applications. , 2003, The American journal of medicine.

[83]  Peter Andras,et al.  How should we rate research?: Counting number of publications may be best research performance measure , 2006, BMJ : British Medical Journal.

[84]  Jane Bliss-Holtz,et al.  Rules of the game , 1991, On the issues.

[85]  J. Bland,et al.  The tyranny of power: is there a better way to calculate sample size? , 2009, BMJ : British Medical Journal.

[86]  Brian A. Nosek,et al.  Power failure: why small sample size undermines the reliability of neuroscience , 2013, Nature Reviews Neuroscience.

[87]  Marcus R. Munafò,et al.  The Burden of the “False‐Negatives” in Clinical Development: Analyses of Current and Alternative Scenarios and Corrective Measures , 2017, Clinical and translational science.

[88]  John Kitchener Sakaluk,et al.  Exploring Small, Confirming Big: An alternative system to The New Statistics for advancing cumulative and replicable psychological research , 2016 .

[89]  A. Kühberger,et al.  A comprehensive review of reporting practices in psychological journals: Are effect sizes really enough? , 2013 .

[90]  Theodor D. Sterling,et al.  Publication decisions revisited: the effect of the outcome of statistical tests on the decision to p , 1995 .

[91]  John P. A. Ioannidis,et al.  The credibility crisis in research: Can economics tools help? , 2017, PLoS biology.

[92]  Ulrich Dirnagl,et al.  Distinguishing between Exploratory and Confirmatory Preclinical Research Will Improve Translation , 2014, PLoS biology.

[93]  George F Borm,et al.  Publication bias was not a good reason to discourage trials with low power. , 2009, Journal of clinical epidemiology.

[94]  Ann Oakley,et al.  Trust in Numbers , 1995 .

[95]  S. Senn Misunderstanding publication bias: editors are not blameless after all , 2012, F1000Research.

[96]  Simine Vazire,et al.  The N-Pact Factor: Evaluating the Quality of Empirical Journals with Respect to Sample Size and Statistical Power , 2014, PloS one.