Accounting for Variance in Machine Learning Benchmarks

Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process, revealing that variance due to data sampling, parameter initialization and hyperparameter choice impact markedly the results. We analyze the predominant comparison methods used today in the light of this variance. We show a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a 51× reduction in compute cost. Building on these results, we study the error rate of detecting improvements, on five different deep-learning tasks/architectures. This study leads us to propose recommendations for performance comparisons.

[1]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[2]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[3]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[4]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[5]  G. E. Noether,et al.  Sample Size Determination for Some Common Nonparametric Tests , 1987 .

[6]  Kyle Gorman,et al.  We Need to Talk about Standard Splits , 2019, ACL.

[7]  Quoc V. Le,et al.  Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Pascal Vincent,et al.  Unreproducible Research is Reproducible , 2019, ICML.

[9]  Stefan Riezler,et al.  On Some Pitfalls in Automatic Evaluation and Significance Testing for MT , 2005, IEEvaluation@ACL.

[10]  Edward Raff,et al.  Research Reproducibility as a Survival Analysis , 2020, AAAI.

[11]  Alessandro Sette,et al.  The Immune Epitope Database (IEDB): 2018 update , 2018, Nucleic Acids Res..

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Luc Van Gool,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[14]  E. S. Pearson,et al.  ON THE USE AND INTERPRETATION OF CERTAIN TEST CRITERIA FOR PURPOSES OF STATISTICAL INFERENCE PART I , 1928 .

[15]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Edward Raff,et al.  A Step Toward Quantifying Independently Reproducible Machine Learning Research , 2019, NeurIPS.

[17]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[18]  Dan Klein,et al.  An Empirical Investigation of Statistical Significance in NLP , 2012, EMNLP.

[19]  Rudolf Kadlec,et al.  Knowledge Base Completion: Baselines Strike Back , 2017, Rep4NLP@ACL.

[20]  Jose D. Perezgonzalez,et al.  Fisher, Neyman-Pearson or NHST? A tutorial for teaching data testing , 2015, Front. Psychol..

[21]  S. Lemieux,et al.  MHC class I-associated peptides derive from selective regions of the human genome. , 2016, The Journal of clinical investigation.

[22]  Kaiming He,et al.  Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.

[23]  Iryna Gurevych,et al.  Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging , 2017, EMNLP.

[24]  Kevin Leyton-Brown,et al.  An Efficient Approach for Assessing Hyperparameter Importance , 2014, ICML.

[25]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[26]  O. Lund,et al.  NetMHCpan, a Method for Quantitative Predictions of Peptide Binding to Any HLA-A and -B Locus Protein of Known Sequence , 2007, PloS one.

[27]  B. Efron,et al.  The Jackknife: The Bootstrap and Other Resampling Plans. , 1983 .

[28]  Mario Lucic,et al.  Are GANs Created Equal? A Large-Scale Study , 2017, NeurIPS.

[29]  Marc Sebban,et al.  Metric Learning , 2015, Metric Learning.

[30]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Rotem Dror,et al.  Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets , 2017, TACL.

[32]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[33]  Jose Javier Gonzalez Ortiz,et al.  What is the State of Neural Network Pruning? , 2020, MLSys.

[34]  L. Breiman OUT-OF-BAG ESTIMATION , 1996 .

[35]  Maja Pohar Perme,et al.  Confidence intervals for the Mann–Whitney test , 2018, Statistical methods in medical research.

[36]  Angelo J. Canty,et al.  Bootstrap diagnostics and remedies , 2006 .

[37]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[38]  Alex Rubinsteyn,et al.  MHCflurry: Open-Source Class I MHC Binding Affinity Prediction. , 2018, Cell systems.

[39]  Kurt Hornik,et al.  The Design and Analysis of Benchmark Experiments , 2005 .

[40]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[41]  Yoshua Bengio,et al.  Inference for the Generalization Error , 1999, Machine Learning.

[42]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[43]  Cody Coleman,et al.  MLPerf Inference Benchmark , 2019, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[44]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[45]  Eibe Frank,et al.  Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms , 2004, PAKDD.

[46]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[47]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[48]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[49]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[50]  Dirk Hovy,et al.  What’s in a p-value in NLP? , 2014, CoNLL.

[51]  Aaron Klein,et al.  RoBO : A Flexible and Robust Bayesian Optimization Framework in Python , 2017 .

[52]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[53]  M. Nielsen,et al.  NetMHCpan-4.0: Improved Peptide–MHC Class I Interaction Predictions Integrating Eluted Ligand and Peptide Binding Affinity Data , 2017, The Journal of Immunology.

[54]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[55]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[56]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[57]  Dietmar Jannach,et al.  Are we really making much progress? A worrying analysis of recent neural recommendation approaches , 2019, RecSys.

[58]  Ido Dagan,et al.  The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.

[59]  Xavier Bouthillier,et al.  Survey of machine-learning experimental methods at NeurIPS2019 and ICLR2020 , 2020 .

[60]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[61]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[62]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[63]  Li Fei-Fei,et al.  Progressive Neural Architecture Search , 2017, ECCV.

[64]  P. Bühlmann,et al.  Analyzing Bagging , 2001 .

[65]  Chris Dyer,et al.  On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.