Optimizer Benchmarking Needs to Account for Hyperparameter Tuning

The performance of optimizers, particularly in deep learning, depends considerably on their chosen hyperparameter configuration. The efficacy of optimizers is often studied under near-optimal problem-specific hyperparameters, and finding these settings may be prohibitively costly for practitioners. In this work, we argue that a fair assessment of optimizers' performance must take the computational cost of hyperparameter tuning into account, i.e., how easy it is to find good hyperparameter configurations using an automatic hyperparameter search. Evaluating a variety of optimizers on an extensive set of standard datasets and architectures, our results indicate that Adam is the most practical solution, particularly in low-budget scenarios.

[1]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[2]  H. Robbins A Stochastic Approximation Method , 1951 .

[3]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[4]  Susan A. Murphy,et al.  Monographs on statistics and applied probability , 1990 .

[5]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[6]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[7]  Kevin Leyton-Brown,et al.  Identifying Key Algorithm Parameters and Instance Features Using Forward Selection , 2013, LION.

[8]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[9]  Wei Liu,et al.  ParseNet: Looking Wider to See Better , 2015, ArXiv.

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[13]  Richard Socher,et al.  Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.

[14]  Anastasios Kyrillidis,et al.  Minimum weight norm models do not always generalize well for over-parameterized problems , 2018 .

[15]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[16]  D. Sculley,et al.  Winner's Curse? On Pace, Progress, and Empirical Rigor , 2018, ICLR.

[17]  Jinghui Chen,et al.  Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks , 2018, IJCAI.

[18]  Anastasios Kyrillidis,et al.  Minimum norm solutions do not always generalize well for over-parameterized problems , 2018, ArXiv.

[19]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[20]  Jan N. van Rijn,et al.  Hyperparameter Importance Across Datasets , 2017, KDD.

[21]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  An empirical study on hyperparameter tuning of decision trees , 2018, ArXiv.

[22]  Mario Lucic,et al.  Are GANs Created Equal? A Large-Scale Study , 2017, NeurIPS.

[23]  Chris Dyer,et al.  On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.

[24]  John C. Duchi,et al.  Stochastic (Approximate) Proximal Point Methods: Convergence, Optimality, and Adaptivity , 2018, SIAM J. Optim..

[25]  Roy Schwartz,et al.  Show Your Work: Improved Reporting of Experimental Results , 2019, EMNLP.

[26]  Lars Kotthoff,et al.  Automated Machine Learning: Methods, Systems, Challenges , 2019, The Springer Series on Challenges in Machine Learning.

[27]  Frank Schneider,et al.  DeepOBS: A Deep Learning Optimizer Benchmark Suite , 2019, ICLR.

[28]  Bernd Bischl,et al.  Tunability: Importance of Hyperparameters of Machine Learning Algorithms , 2018, J. Mach. Learn. Res..

[29]  Marius Lindauer,et al.  Pitfalls and Best Practices in Algorithm Configuration , 2017, J. Artif. Intell. Res..

[30]  Jaehoon Lee,et al.  On Empirical Comparisons of Optimizers for Deep Learning , 2019, ArXiv.

[31]  John C. Duchi,et al.  The importance of better models in stochastic optimization , 2019, Proceedings of the National Academy of Sciences.

[32]  Jascha Sohl-Dickstein,et al.  Measuring the Effects of Data Parallelism on Neural Network Training , 2018, J. Mach. Learn. Res..

[33]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[34]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[35]  Ben Poole,et al.  Using a thousand optimization tasks to learn hyperparameter search strategies , 2020, ArXiv.