Dataset Decay: the problem of sequential analyses on open datasets

Open data has two principal uses: (i) to reproduce original findings and (ii) to allow researchers to ask new questions with existing data. The latter enables discoveries by allowing a more diverse set of viewpoints and hypotheses to approach the data, which is self-evidently advantageous for the progress of science. However, if many researchers reuse the same dataset, multiple statistical testing may increase false positives in the literature. Current practice suggests that the number of tests to be corrected is the number of simultaneous tests performed by a researcher. Here we demonstrate that sequential hypothesis testing on the same dataset by multiple researchers can inflate error rates. This finding is troubling because, as more researchers embrace an open dataset, the likelihood of false positives (i.e. type I errors) will increase. Thus, we should expect a dataset’s utility for discovering new true relations between variables to decay. We consider several sequential correction procedures. These solutions can reduce the number of false positives but, at the same time, can prompt undesired challenges to open data (e.g. incentivising restricted access).

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  J. Tukey The Philosophy of Multiple Comparisons , 1991 .

[3]  A. Tamhane,et al.  Multiple Comparison Procedures , 1989 .

[4]  Steven Kern,et al.  Beyond open data: realising the health benefits of sharing data , 2016, British Medical Journal.

[5]  W. Wilson,et al.  A note on the incosistency inherent in the necessity to perform multiple comparisons. , 1962, Psychological bulletin.

[6]  Aaron Roth,et al.  Guaranteed Validity for Empirical Approaches to Adaptive Data Analysis , 2020, AISTATS.

[7]  R. Rosenthal The file drawer problem and tolerance for null results , 1979 .

[8]  P. Dayan,et al.  A mathematical model explains saturating axon guidance responses to molecular gradients , 2016, eLife.

[9]  Dean P. Foster,et al.  α‐investing: a procedure for sequential control of expected false discoveries , 2008 .

[10]  Thorsten Dickhaus,et al.  Simultaneous Statistical Inference , 2014, Springer Berlin Heidelberg.

[11]  Toniann Pitassi,et al.  Preserving Statistical Validity in Adaptive Data Analysis , 2014, STOC.

[12]  Brian A. Nosek,et al.  Promoting an open research culture , 2015, Science.

[13]  John W. Tukey,et al.  We Need Both Exploratory and Confirmatory , 1980 .

[14]  Toniann Pitassi,et al.  Guilt-free data reuse , 2017, Commun. ACM.

[15]  Michael Grüninger,et al.  Introduction , 2002, CACM.

[16]  Steen Moeller,et al.  The Human Connectome Project: A data acquisition perspective , 2012, NeuroImage.

[17]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[18]  Brian A. Nosek,et al.  How open science helps researchers succeed , 2016, eLife.

[19]  G. Hancock,et al.  The Quest for α: Developments in Multiple Comparison Procedures in the Quarter Century Since Games (1971) , 1996 .

[20]  T. A. Ryan,et al.  The experiment as the unit for computing rates of error. , 1962, Psychological bulletin.

[21]  Gaël Varoquaux,et al.  Cross-validation failure: Small sample sizes lead to large error bars , 2017, NeuroImage.

[22]  David R. Cox,et al.  A Remark on Multiple Comparison Methods , 1965 .

[23]  Jennifer C Molloy,et al.  The Open Knowledge Foundation: Open Data Means Better Science , 2011, PLoS biology.

[24]  T. A. Ryan Multiple comparison in psychological research. , 1959 .

[25]  S. Rosset,et al.  Generalized α‐investing: definitions, optimality results and application to public databases , 2014 .

[26]  D. Donoho 50 Years of Data Science , 2017 .

[27]  C. Grey,et al.  Mouse PRDM9 DNA-Binding Specificity Determines Sites of Histone H3 Lysine 4 Trimethylation for Initiation of Meiotic Recombination , 2011, PLoS biology.