Statistical methods for replicability assessment

Large-scale replication studies like the Reproducibility Project: Psychology (RP:P) provide invaluable systematic data on scientific replicability, but most analyses and interpretations of the data fail to agree on the definition of "replicability" and disentangle the inexorable consequences of known selection bias from competing explanations. We discuss three concrete definitions of replicability based on (1) whether published findings about the signs of effects are mostly correct, (2) how effective replication studies are in reproducing whatever true effect size was present in the original experiment, and (3) whether true effect sizes tend to diminish in replication. We apply techniques from multiple testing and post-selection inference to develop new methods that answer these questions while explicitly accounting for selection bias. Our analyses suggest that the RP:P dataset is largely consistent with publication bias due to selection of significant effects. The methods in this paper make no distributional assumptions about the true effect sizes.

[1]  Linda M. Collins,et al.  Replication in Prevention Science , 2011, Prevention Science.

[2]  David Madigan,et al.  Discussion: An estimate of the science-wise false discovery rate and application to the top medical literature. , 2014, Biostatistics.

[3]  Y. Benjamini,et al.  Selection Adjusted Confidence Intervals With More Power to Determine the Sign , 2013 .

[4]  L. Hedges Modeling publication selection effects in meta-analysis , 1992 .

[5]  Monya Baker,et al.  Over half of psychology studies fail reproducibility test , 2015, Nature.

[6]  Daniel Yekutieli,et al.  Adjusted Bayesian inference for selected parameters , 2008, 0801.0499.

[7]  Robbie C. M. van Aert,et al.  Examining reproducibility in psychology: A hybrid method for combining a statistically significant original study and a replication , 2017, Behavior Research Methods.

[8]  Timothy D. Wilson,et al.  Comment on “Estimating the reproducibility of psychological science” , 2016, Science.

[9]  J. Carlin,et al.  Beyond Power Calculations , 2014, Perspectives on psychological science : a journal of the Association for Psychological Science.

[10]  Francis Tuerlinckx,et al.  Type S error rates for classical and Bayesian single and multiple comparison procedures , 2000, Comput. Stat..

[11]  Allan R Sampson,et al.  Drop‐the‐Losers Design: Normal Case , 2005, Biometrical journal. Biometrische Zeitschrift.

[12]  Jeffrey T Leek,et al.  An estimate of the science-wise false discovery rate and application to the top medical literature. , 2014, Biostatistics.

[13]  Brian A. Nosek,et al.  Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015 , 2018, Nature Human Behaviour.

[14]  Y. Benjamini,et al.  Screening for Partial Conjunction Hypotheses , 2008, Biometrics.

[15]  R. Fisher 014: On the "Probable Error" of a Coefficient of Correlation Deduced from a Small Sample. , 1921 .

[16]  J. Pritchard,et al.  Overcoming the winner's curse: estimating penetrance parameters from case-control data. , 2007, American journal of human genetics.

[17]  S Duval,et al.  Trim and Fill: A Simple Funnel‐Plot–Based Method of Testing and Adjusting for Publication Bias in Meta‐Analysis , 2000, Biometrics.

[18]  Is Happiness Having What You Want, Wanting What You Have, or Both? , 2008, Psychological science.

[19]  J. Vandekerckhove,et al.  A Bayesian Perspective on the Reproducibility Project: Psychology , 2016, PloS one.

[20]  John P. A. Ioannidis,et al.  What does research reproducibility mean? , 2016, Science Translational Medicine.

[21]  Y. Benjamini,et al.  On the Adaptive Control of the False Discovery Rate in Multiple Testing With Independent Statistics , 2000 .

[22]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[23]  Valen E. Johnson,et al.  On the Reproducibility of Psychological Science , 2017, Journal of the American Statistical Association.

[24]  John P A Ioannidis,et al.  Discussion: Why "An estimate of the science-wise false discovery rate and application to the top medical literature" is false. , 2014, Biostatistics.

[25]  Elizabeth Gilbert,et al.  Reproducibility Project: Results (Part of symposium called "The Reproducibility Project: Estimating the Reproducibility of Psychological Science") , 2014 .

[26]  Dennis L. Sun,et al.  Optimal Inference After Model Selection , 2014, 1410.2597.

[27]  C. Quensel The distribution of the partial correlation coefficient in samples from multivariate universesin a special case of non-normally distributed random variables , 1953 .

[28]  Andrew Gelman,et al.  Discussion: Difficulties in making inferences about scientific truth from distributions of published p-values. , 2014, Biostatistics.

[29]  Valerie Purdie-Vaughns,et al.  Social identity contingencies: how diversity cues signal threat or safety for African Americans in mainstream institutions. , 2008, Journal of personality and social psychology.

[30]  I. Neath,et al.  Modeling distributions of immediate memory effects: no strategies needed? , 2008, Journal of experimental psychology. Learning, memory, and cognition.

[31]  A. Oliva,et al.  The Representation of Simple Ensemble Visual Features Outside the Focus of Attention , 2008, Psychological science.

[32]  Leif D. Nelson,et al.  P-Curve: A Key to the File Drawer , 2013, Journal of experimental psychology. General.

[33]  Wolfgang Stroebe,et al.  Are most published social psychological findings false , 2016 .

[34]  Y. Benjamini,et al.  False Discovery Rate–Adjusted Multiple Confidence Intervals for Selected Parameters , 2005 .

[35]  Reginald B. Adams,et al.  Many Labs 2: Investigating Variation in Replicability Across Sample and Setting , 2018 .

[36]  Coreen Farris,et al.  Perceptual Mechanisms That Characterize Gender Differences in Decoding Women's Sexual Intent , 2008, Psychological science.

[37]  Isaiah Andrews,et al.  Identification of and Correction for Publication Bias , 2017, American Economic Review.

[38]  Christopher D. Chambers,et al.  Redefine statistical significance , 2017, Nature Human Behaviour.

[39]  Leif D. Nelson,et al.  p-Curve and Effect Size , 2014, Perspectives on psychological science : a journal of the Association for Psychological Science.

[40]  Rafael Malach,et al.  Conjunction group analysis: An alternative to mixed/random effect analysis , 2007, NeuroImage.

[41]  Wolfgang Steinel,et al.  A social functional approach to emotions in bargaining: when communicating anger pays and when it backfires. , 2008, Journal of personality and social psychology.

[42]  C. Dodson,et al.  Stereotypes and retrieval-provoked illusory source recollections. , 2008, Journal of experimental psychology. Learning, memory, and cognition.

[43]  F. Korner‐Nievergelt,et al.  The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research , 2017, PeerJ.

[44]  Michael C. Frank,et al.  Response to Comment on “Estimating the reproducibility of psychological science” , 2016, Science.

[45]  Brian A. Nosek,et al.  Making sense of replications , 2017, eLife.

[46]  Robbie C. M. van Aert,et al.  Bayesian evaluation of effect size after replicating an original study , 2017, PloS one.

[47]  Gideon Nave,et al.  Evaluating replicability of laboratory experiments in economics , 2016, Science.

[48]  Alice C. Harris,et al.  What is Reproducibility , 2006 .

[49]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[50]  R. Fisher 035: The Distribution of the Partial Correlation Coefficient. , 1924 .

[51]  Dennis L. Sun,et al.  Exact post-selection inference, with application to the lasso , 2013, 1311.6238.

[52]  Yotam Hechtlinger,et al.  Discussion: An estimate of the science-wise false discovery rate and applications to top medical journals by Jager and Leek. , 2014, Biostatistics.

[53]  John D. Storey A direct approach to false discovery rates , 2002 .