论文信息 - Accounting for Variance in Machine Learning Benchmarks

Accounting for Variance in Machine Learning Benchmarks

Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process, revealing that variance due to data sampling, parameter initialization and hyperparameter choice impact markedly the results. We analyze the predominant comparison methods used today in the light of this variance. We show a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a 51× reduction in compute cost. Building on these results, we study the error rate of detecting improvements, on five different deep-learning tasks/architectures. This study leads us to propose recommendations for performance comparisons.

[1] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[2] M. Kenward,et al. An Introduction to the Bootstrap , 2007 .

[3] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[4] S. Dudoit,et al. Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[5] G. E. Noether,et al. Sample Size Determination for Some Common Nonparametric Tests , 1987 .

[6] Kyle Gorman,et al. We Need to Talk about Standard Splits , 2019, ACL.

[7] Quoc V. Le,et al. Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Pascal Vincent,et al. Unreproducible Research is Reproducible , 2019, ICML.

[9] Stefan Riezler,et al. On Some Pitfalls in Automatic Evaluation and Significance Testing for MT , 2005, IEEvaluation@ACL.

[10] Edward Raff,et al. Research Reproducibility as a Survival Analysis , 2020, AAAI.

[11] Alessandro Sette,et al. The Immune Epitope Database (IEDB): 2018 update , 2018, Nucleic Acids Res..

[12] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Luc Van Gool,et al. The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[14] E. S. Pearson,et al. ON THE USE AND INTERPRETATION OF CERTAIN TEST CRITERIA FOR PURPOSES OF STATISTICAL INFERENCE PART I , 1928 .

[15] Trevor Darrell,et al. Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16] Edward Raff,et al. A Step Toward Quantifying Independently Reproducible Machine Learning Research , 2019, NeurIPS.

[17] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[18] Dan Klein,et al. An Empirical Investigation of Statistical Significance in NLP , 2012, EMNLP.

[19] Rudolf Kadlec,et al. Knowledge Base Completion: Baselines Strike Back , 2017, Rep4NLP@ACL.

[20] Jose D. Perezgonzalez,et al. Fisher, Neyman-Pearson or NHST? A tutorial for teaching data testing , 2015, Front. Psychol..

[21] S. Lemieux,et al. MHC class I-associated peptides derive from selective regions of the human genome. , 2016, The Journal of clinical investigation.

[22] Kaiming He,et al. Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.

[23] Iryna Gurevych,et al. Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging , 2017, EMNLP.

[24] Kevin Leyton-Brown,et al. An Efficient Approach for Assessing Hyperparameter Importance , 2014, ICML.

[25] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[26] O. Lund,et al. NetMHCpan, a Method for Quantitative Predictions of Peptide Binding to Any HLA-A and -B Locus Protein of Known Sequence , 2007, PloS one.

[27] B. Efron,et al. The Jackknife: The Bootstrap and Other Resampling Plans. , 1983 .

[28] Mario Lucic,et al. Are GANs Created Equal? A Large-Scale Study , 2017, NeurIPS.

[29] Marc Sebban,et al. Metric Learning , 2015, Metric Learning.

[30] S. Henikoff,et al. Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[31] Rotem Dror,et al. Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets , 2017, TACL.

[32] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[33] Jose Javier Gonzalez Ortiz,et al. What is the State of Neural Network Pruning? , 2020, MLSys.

[34] L. Breiman. OUT-OF-BAG ESTIMATION , 1996 .

[35] Maja Pohar Perme,et al. Confidence intervals for the Mann–Whitney test , 2018, Statistical methods in medical research.

[36] Angelo J. Canty,et al. Bootstrap diagnostics and remedies , 2006 .

[37] Antonio Torralba,et al. Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[38] Alex Rubinsteyn,et al. MHCflurry: Open-Source Class I MHC Binding Affinity Prediction. , 2018, Cell systems.

[39] Kurt Hornik,et al. The Design and Analysis of Benchmark Experiments , 2005 .

[40] Janez Demsar,et al. Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[41] Yoshua Bengio,et al. Inference for the Generalization Error , 1999, Machine Learning.

[42] Thomas Wolf,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.