论文信息 - Survey of machine-learning experimental methods at NeurIPS2019 and ICLR2020

Survey of machine-learning experimental methods at NeurIPS2019 and ICLR2020

How do machine-learning researchers run their empirical validation? In the context of a push for improved reproducibility and benchmarking, this question is important to develop new tools for model comparison. This document summarizes a simple survey about experimental procedures, sent to authors of published papers at two leading conferences, NeurIPS 2019 and ICLR 2020. It gives a simple picture of how hyper-parameters are set, how many baselines and datasets are included, or how seeds are used.

Xavier Bouthillier | Gaël Varoquaux | Xavier Bouthillier | G. Varoquaux

[1] Chris Dyer,et al. On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.

[2] Pascal Vincent,et al. Unreproducible Research is Reproducible , 2019, ICML.

[3] Yoshua Bengio,et al. Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[4] Janez Demsar,et al. Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[5] Roy Schwartz,et al. Show Your Work: Improved Reporting of Experimental Results , 2019, EMNLP.

[6] Nikolaus Hansen,et al. The CMA Evolution Strategy: A Comparing Review , 2006, Towards a New Evolutionary Computation.

[7] Kurt Hornik,et al. The Design and Analysis of Benchmark Experiments , 2005 .

[8] Ameet Talwalkar,et al. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[9] Lars Kotthoff,et al. Automated Machine Learning: Methods, Systems, Challenges , 2019, The Springer Series on Challenges in Machine Learning.

[10] Mario Lucic,et al. Are GANs Created Equal? A Large-Scale Study , 2017, NeurIPS.

[11] Rudolf Kadlec,et al. Knowledge Base Completion: Baselines Strike Back , 2017, Rep4NLP@ACL.

[12] Thomas G. Dietterich. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[13] Eibe Frank,et al. Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms , 2004, PAKDD.

[14] Philip Bachman,et al. Deep Reinforcement Learning that Matters , 2017, AAAI.

[15] Jasper Snoek,et al. Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.