Test data reuse for evaluation of adaptive machine learning algorithms: over-fitting to a fixed 'test' dataset and a potential solution

After the initial release of a machine learning algorithm, the subsequently gathered data can be used to augment the training dataset in order to modify or fine-tune the algorithm. For algorithm performance evaluation that generalizes to a targeted population of cases, ideally, test datasets randomly drawn from the targeted population are used. To ensure that test results generalize to new data, the algorithm needs to be evaluated on new and independent test data each time a new performance evaluation is required. However, medical test datasets of sufficient quality are often hard to acquire, and it is tempting to utilize a previously-used test dataset for a new performance evaluation. With extensive simulation studies, we illustrate how such a "naive" approach to test data reuse can inadvertently result in overfitting the algorithm to the test data, even when only a global performance metric is reported back from the test dataset. The overfitting behavior leads to a loss in generalization and overly optimistic conclusions about the algorithm performance. We investigate the use of the Thresholdout method of Dwork et. al. (Ref. 1) to tackle this problem. Thresholdout allows repeated reuse of the same test dataset. It essentially reports a noisy version of the performance metric on the test data, and provides theoretical guarantees on how many times the test dataset can be accessed to ensure generalization of the reported answers to the underlying distribution. With extensive simulation studies, we show that Thresholdout indeed substantially reduces the problem of overfitting to the test data under the simulation conditions, at the cost of a mild additional uncertainty on the reported test performance. We also extend some of the theoretical guarantees to the area under the ROC curve as the reported performance metric.

[1]  Toniann Pitassi,et al.  Preserving Statistical Validity in Adaptive Data Analysis , 2014, STOC.

[2]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[3]  Aaron Roth,et al.  Adaptive Learning with Robust Generalization Guarantees , 2016, COLT.

[4]  Kyle J Myers,et al.  Evaluating imaging and computer-aided detection and diagnosis devices at the FDA. , 2012, Academic radiology.

[5]  Peter F. Neher,et al.  Tractography-based connectomes are dominated by false-positive connections , 2016, bioRxiv.

[6]  Gaël Varoquaux,et al.  Cross-validation failure: Small sample sizes lead to large error bars , 2017, NeuroImage.

[7]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[8]  Avrim Blum,et al.  The Ladder: A Reliable Leaderboard for Machine Learning Competitions , 2015, ICML.

[9]  James Zou,et al.  Controlling Bias in Adaptive Data Analysis Using Information Theory , 2015, AISTATS.

[10]  Raef Bassily,et al.  Algorithmic stability for adaptive data analysis , 2015, STOC.

[11]  Huey-miin Hsueh,et al.  Comparison of Methods for Estimating the Number of True Null Hypotheses in Multiplicity Testing , 2003, Journal of biopharmaceutical statistics.

[12]  Juha Reunanen,et al.  Overfitting in Making Comparisons Between Variable Selection Methods , 2003, J. Mach. Learn. Res..

[13]  Toniann Pitassi,et al.  The reusable holdout: Preserving validity in adaptive data analysis , 2015, Science.

[14]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[15]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[16]  Andrew Gelman,et al.  Data-dependent analysis—a "garden of forking paths"— explains why many statistically significant comparisons don't hold up. , 2014 .

[17]  Max Kuhn,et al.  Applied Predictive Modeling , 2013 .

[18]  J. Brooks Why most published research findings are false: Ioannidis JP, Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina, Greece , 2008 .

[19]  Gavin C. Cawley,et al.  On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation , 2010, J. Mach. Learn. Res..

[20]  Howard Bowman,et al.  I Tried a Bunch of Things: The Dangers of Unexpected Overfitting in Classification , 2016, bioRxiv.

[21]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[22]  H. Pashler,et al.  Puzzlingly High Correlations in fMRI Studies of Emotion, Personality, and Social Cognition 1 , 2009, Perspectives on psychological science : a journal of the Association for Psychological Science.

[23]  Toniann Pitassi,et al.  Generalization in Adaptive Data Analysis and Holdout Reuse , 2015, NIPS.

[24]  J. Ioannidis,et al.  Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature , 2017, PLoS biology.

[25]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[26]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[27]  Adam D. Smith,et al.  Information, Privacy and Stability in Adaptive Data Analysis , 2017, ArXiv.

[28]  Hans Knutsson,et al.  Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates , 2016, Proceedings of the National Academy of Sciences.

[29]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[30]  Glenn Fung,et al.  On the Dangers of Cross-Validation. An Experimental Evaluation , 2008, SDM.

[31]  R. F. Wagner,et al.  Classifier design for computer-aided diagnosis: effects of finite sample size on the mean performance of classical and neural network classifiers. , 1999, Medical physics.

[32]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[33]  W. K. Simmons,et al.  Circular analysis in systems neuroscience: the dangers of double dipping , 2009, Nature Neuroscience.

[34]  Toniann Pitassi,et al.  Guilt-free data reuse , 2017, Commun. ACM.

[35]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[36]  Thomas Steinke,et al.  Generalization for Adaptively-chosen Estimators via Stable Median , 2017, COLT.

[37]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .