A new standard for the analysis and design of replication studies

A new standard is proposed for the evidential assessment of replication studies. The approach combines a specific reverse-Bayes technique with prior-predictive tail probabilities to define replication success. The method gives rise to a quantitative measure for replication success, called the sceptical p-value. The sceptical p-value integrates traditional significance of both the original and replication study with a comparison of the respective effect sizes. It incorporates the uncertainty of both the original and replication effect estimates and reduces to the ordinary p-value of the replication study if the uncertainty of the original effect estimate is ignored. The proposed framework can also be used to determine the power or the required sample size to achieve replication success. Numerical calculations highlight the difficulty to achieve replication success if the evidence from the original study is only suggestive. An application to data from the Open Science Collaboration project on the replicability of psychological science illustrates the proposed methodology.

[1]  L. Held The assessment of intrinsic credibility and a new argument for p < 0.005 , 2018, Royal Society Open Science.

[2]  A Fletcher,et al.  Implications for trials in progress of publication of positive results , 1993, The Lancet.

[3]  Samantha F. Anderson,et al.  Addressing the “Replication Crisis”: Using Original Studies to Design Replication Studies with Appropriate Statistical Power , 2017, Multivariate behavioral research.

[4]  Christopher D. Chambers,et al.  Redefine statistical significance , 2017, Nature Human Behaviour.

[5]  J. Ioannidis Why Most Published Research Findings Are False , 2005, PLoS medicine.

[6]  A O'Hagan,et al.  Bayesian Assessment of Sample Size for Clinical Trials of Cost-Effectiveness , 2001, Medical decision making : an international journal of the Society for Medical Decision Making.

[7]  S. Greenland Bayesian perspectives for epidemiological research , 2006 .

[8]  Alice C. Harris,et al.  What is Reproducibility , 2006 .

[9]  Eric-Jan Wagenmakers,et al.  Replication Bayes factors from evidence updating , 2018, Behavior Research Methods.

[10]  Robbie C. M. van Aert,et al.  Bayesian evaluation of effect size after replicating an original study , 2017, PloS one.

[11]  Gideon Nave,et al.  Evaluating replicability of laboratory experiments in economics , 2016, Science.

[12]  V. Johnson Revised standards for statistical evidence , 2013, Proceedings of the National Academy of Sciences.

[13]  Michael B. Yaffe,et al.  Reproducibility in science , 2015, Science Signaling.

[14]  R. Batra,et al.  Absence of Evidence Is Not Evidence of Absence , 2019, The American journal of bioethics : AJOB.

[15]  I. Good The Bayes/Non-Bayes Compromise: A Brief Review , 1992 .

[16]  L. Stein,et al.  Probability and the Weighing of Evidence , 1950 .

[17]  Sander Greenland,et al.  Null misinterpretation in statistical testing and its impact on health risk assessment. , 2011, Preventive medicine.

[18]  Brian A. Nosek,et al.  Power failure: why small sample size undermines the reliability of neuroscience , 2013, Nature Reviews Neuroscience.

[19]  Brian A. Nosek,et al.  Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015 , 2018, Nature Human Behaviour.

[20]  Michael C. Frank,et al.  Estimating the reproducibility of psychological science , 2015, Science.

[21]  Michael Evans,et al.  Checking for prior-data conflict , 2006 .

[22]  M. J. Bayarri,et al.  Bayesian Design of “Successful” Replications , 2002 .

[23]  D. Spiegelhalter,et al.  A predictive approach to selecting the size of a clinical trial, based on subjective clinical opinion. , 1986, Statistics in medicine.

[24]  L. Held p-Values for Credibility , 2017, 1712.03032.

[25]  Felix D. Schönbrodt,et al.  Bayes factor design analysis: Planning for compelling evidence , 2016, Psychonomic Bulletin & Review.

[26]  David J. Spiegelhalter,et al.  Bayesian Approaches to Randomized Trials , 1994, Bayesian Biostatistics.

[27]  U. Simonsohn Small Telescopes , 2014, Psychological science.

[28]  John P. A. Ioannidis,et al.  What does research reproducibility mean? , 2016, Science Translational Medicine.

[29]  Anthony O'Hagan,et al.  Assurance in clinical trial design , 2005 .

[30]  George E. P. Box,et al.  Sampling and Bayes' inference in scientific modelling and robustness , 1980 .

[31]  D. Colquhoun The False Positive Risk: A Proposal Concerning What to Do About p-Values , 2018, The American Statistician.

[32]  Rae Woong Park Bayesian Approaches to Clinical Trials and Health-Care Evaluation (Statics in Practice)(2004), David J. Spiegelhalter et al., John Wiley and Sons. , 2006 .

[33]  S. Goodman,et al.  A comment on replication, p-values and evidence. , 1992, Statistics in medicine.

[34]  J. Ioannidis The Proposal to Lower P Value Thresholds to .005. , 2018, JAMA.

[35]  L. Held Reverse-Bayes analysis of two common misinterpretations of significance tests , 2013, Clinical trials.

[36]  L. Kennedy‐Shaffer When the Alpha is the Omega: P-Values, "Substantial Evidence," and the 0.05 Standard at FDA. , 2017, Food and drug law journal.

[37]  Sander Greenland,et al.  Bayesian perspectives for epidemiological research: I. Foundations and basic methods. , 2006, International journal of epidemiology.

[38]  D. Spiegelhalter,et al.  Bayesian Approaches to Clinical Trials and Health-Care Evaluation: Spiegelhalter/Clinical Trials and Health-Care Evaluation , 2004 .

[39]  R. Matthews,et al.  Why should clinicians care about Bayesian methods , 2001 .

[40]  Leonhard Held,et al.  Adaptive power priors with empirical Bayes for clinical trials , 2017, Pharmaceutical statistics.

[41]  D. Spiegelhalter,et al.  Monitoring clinical trials: conditional or predictive power? , 1986, Controlled clinical trials.

[42]  M. A. Best Bayesian Approaches to Clinical Trials and Health‐Care Evaluation , 2005 .

[43]  J. Leek,et al.  What Should Researchers Expect When They Replicate Studies? A Statistical View of Replicability in Psychological Science , 2016, Perspectives on psychological science : a journal of the Association for Psychological Science.

[44]  David Colquhoun,et al.  The reproducibility of research and the misinterpretation of p-values , 2017, bioRxiv.

[45]  I. Good Good Thinking: The Foundations of Probability and Its Applications , 1983 .

[46]  J. Ioannidis,et al.  Reproducibility in Science: Improving the Standard for Basic and Preclinical Research , 2015, Circulation research.

[47]  Eric-Jan Wagenmakers,et al.  Bayesian tests to quantify the result of a replication attempt. , 2014, Journal of experimental psychology. General.

[48]  Steven N. Goodman,et al.  Aligning statistical and scientific reasoning , 2016, Science.

[49]  Valen E. Johnson,et al.  On the Reproducibility of Psychological Science , 2017, Journal of the American Statistical Association.

[50]  R. Matthews,et al.  Methods for Assessing the Credibility of Clinical Trial Outcomes , 2001 .

[51]  R. A. Matthews Beyond ‘significance’: principles and practice of the Analysis of Credibility , 2018, Royal Society Open Science.