Ignored evident multiplicity harms replicability -- adjusting for it offers a remedy

It is a central dogma in science that a result of a study should be replicable. Only 90 of the 190 replications attempts were successful. We attribute a substantial part of the problem to selective inference evident in the paper, which is the practice of selecting some of the results from the many. 100 papers in the Reproducibility Project in Psychology were analyzed. It was evident that the reporting of many results is common (77.7 per paper on average). It was further found that the selection from those multiple results is not adjusted for. We propose to account for selection using the hierarchical false discovery rate (FDR) controlling procedure TreeBH of Bogomolov et al. (2020), which exploits hierarchical structures to gain power. Results that were statistically significant after adjustment were 97% of the replicable results (31 of 32). Additionally, only 1 of the 21 non-significant results after adjustment was replicated. Given the easy deployment of adjustment tools and the minor loss of power involved, we argue that addressing multiplicity is an essential missing component in experimental psychology. It should become a required component in the arsenal of replicability enhancing methodologies in the field.

[1]  Christine B. Peterson,et al.  Hypotheses on a tree: new error rates and testing strategies. , 2020, Biometrika.

[2]  Maya B. Mathur,et al.  Many Labs 5: Testing Pre-Data-Collection Peer Review as an Intervention to Increase Replicability , 2019, Advances in Methods and Practices in Psychological Science.

[3]  Division on Earth,et al.  Reproducibility and Replicability in Science , 2019 .

[4]  Sander Greenland,et al.  Scientists rise up against statistical significance , 2019, Nature.

[5]  A. Gelman,et al.  The garden of forking paths : Why multiple comparisons can be a problem , even when there is no “ fishing expedition ” or “ p-hacking ” and the research hypothesis was posited ahead of time ∗ , 2019 .

[6]  Reginald B. Adams,et al.  Many Labs 2: Investigating Variation in Replicability Across Sample and Setting , 2018 .

[7]  Deborah G. Mayo,et al.  Statistical Inference as Severe Testing , 2018 .

[8]  Brian A. Nosek,et al.  Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015 , 2018, Nature Human Behaviour.

[9]  Sebastian Galiani,et al.  How to make replication the norm , 2018, Nature.

[10]  Robbie C. M. van Aert,et al.  Examining reproducibility in psychology: A hybrid method for combining a statistically significant original study and a replication , 2017, Behavior Research Methods.

[11]  Monica Driscoll,et al.  A long journey to reproducible results , 2017, Nature.

[12]  Michael C. Frank,et al.  A Collaborative Approach to Infant Research: Promoting Reproducibility, Best Practices, and Theory-Building. , 2017, Infancy : the official journal of the International Society on Infant Studies.

[13]  Yoav Benjamini,et al.  Addressing reproducibility in single-laboratory phenotyping experiments , 2017, Nature Methods.

[14]  Robbie C. M. van Aert,et al.  Bayesian evaluation of effect size after replicating an original study , 2017, PloS one.

[15]  Brian A. Nosek,et al.  Many Labs 3: Evaluating participant pool quality across the academic semester via replication , 2016 .

[16]  Dong Kyu Lee Alternatives to P value: confidence interval and effect size , 2016, Korean journal of anesthesiology.

[17]  Wolfgang Stroebe,et al.  Are most published social psychological findings false , 2016 .

[18]  John P. A. Ioannidis,et al.  What does research reproducibility mean? , 2016, Science Translational Medicine.

[19]  M. Baker 1,500 scientists lift the lid on reproducibility , 2016, Nature.

[20]  N. Lazar,et al.  The ASA Statement on p-Values: Context, Process, and Purpose , 2016 .

[21]  Gideon Nave,et al.  Evaluating replicability of laboratory experiments in economics , 2016, Science.

[22]  Timothy D. Wilson,et al.  Comment on “Estimating the reproducibility of psychological science” , 2016, Science.

[23]  W. Vanpaemel,et al.  Are We Wasting a Good Crisis? The Availability of Psychological Research Data after the Storm , 2015 .

[24]  Michael C. Frank,et al.  Estimating the reproducibility of psychological science , 2015, Science.

[25]  Victoria Savalei,et al.  Is the call to abandon p-values the red herring of the replicability crisis? , 2015, Front. Psychol..

[26]  Jonathan W. Schooler,et al.  Metascience could rescue the ‘replication crisis’ , 2014, Nature.

[27]  M. Donnellan,et al.  Does Cleanliness Influence Moral Judgments , 2014 .

[28]  Yoav Benjamini,et al.  Selective inference on multiple families of hypotheses , 2014 .

[29]  Brian A. Nosek,et al.  Power failure: why small sample size undermines the reliability of neuroscience , 2013, Nature Reviews Neuroscience.

[30]  H. Beek F1000Prime recommendation of False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. , 2012 .

[31]  G. Loewenstein,et al.  Measuring the Prevalence of Questionable Research Practices With Incentives for Truth Telling , 2012, Psychological science.

[32]  R. Peng Reproducible Research in Computational Science , 2011, Science.

[33]  F. Prinz,et al.  Believe it or not: how much can we rely on published data on potential drug targets? , 2011, Nature Reviews Drug Discovery.

[34]  Linda M. Collins,et al.  Replication in Prevention Science , 2011, Prevention Science.

[35]  D. Fanelli How Many Scientists Fabricate and Falsify Research? A Systematic Review and Meta-Analysis of Survey Data , 2009, PloS one.

[36]  Y. Benjamini,et al.  Screening for Partial Conjunction Hypotheses , 2008, Biometrics.

[37]  Anat Sakov,et al.  Genotype-environment interactions in mouse behavior: a way out of the problem. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[38]  Y. Benjamini,et al.  False Discovery Rate–Adjusted Multiple Confidence Intervals for Selected Parameters , 2005 .

[39]  Yoav Benjamini,et al.  Identifying differentially expressed genes using false discovery rate controlling procedures , 2003, Bioinform..

[40]  L. HARKing: Hypothesizing After the Results are Known , 2002 .

[41]  Susan R. Homack Understanding What ANOVA Post Hoc Tests Are, Really. , 2001 .

[42]  Francis Tuerlinckx,et al.  Type S error rates for classical and Bayesian single and multiple comparison procedures , 2000, Comput. Stat..

[43]  J. Crabbe,et al.  Genetics of mouse behavior: interactions with laboratory environment. , 1999, Science.

[44]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[45]  Joel B. Greenhouse,et al.  Selection Models and the File Drawer Problem , 1988 .

[46]  R. Simes,et al.  An improved Bonferroni procedure for multiple tests of significance , 1986 .

[47]  R. Rosenthal The file drawer problem and tolerance for null results , 1979 .

[48]  Nathaniel C. Smith Replication studies: A neglected aspect of psychological research. , 1970 .

[49]  R. Fisher THE FIDUCIAL ARGUMENT IN STATISTICAL INFERENCE , 1935 .

[50]  R. Fisher,et al.  The Logic of Inductive Inference , 1935 .