论文信息 - Taming Hyper-parameters in Deep Learning Systems - 字舞流文

Taming Hyper-parameters in Deep Learning Systems

Deep learning (DL) systems expose many tuning parameters ("hyper-parameters") that affect the performance and accuracy of trained models. Increasingly users struggle to configure hyper-parameters, and a substantial portion of time is spent tuning them empirically. We argue that future DL systems should be designed to help manage hyper-parameters. We describe how a distributed DL system can (i) remove the impact of hyper-parameters on both performance and accuracy, thus making it easier to decide on a good setting, and (ii) support more powerful dynamic policies for adapting hyper-parameters, which take monitored training metrics into account. We report results from prototype implementations that show the practicality of DL system designs that are hyper-parameter-friendly.

Peter Pietzuch | Alexandros Koliousis | Luo Mai | Andrei-Octavian Brabete | Guo Li | Luo Mai | Guo Li | A. Brabete | A. Koliousis | P. Pietzuch | Andrei-Octavian Brabete

[1] Frank Hutter,et al. Neural Architecture Search: A Survey , 2018, J. Mach. Learn. Res..

[2] Tie-Yan Liu,et al. Convergence Analysis of Distributed Stochastic Gradient Descent with Shuffling , 2017, Neurocomputing.

[3] Elad Hoffer,et al. Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[4] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[5] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[6] Klaus-Robert Müller,et al. Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[7] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[8] Yuanzhou Yang,et al. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes , 2018, ArXiv.

[9] Roger B. Grosse,et al. Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[10] Mark W. Schmidt,et al. Online Learning Rate Adaptation with Hypergradient Descent , 2017, ICLR.

[11] A. Krizhevsky. Convolutional Deep Belief Networks on CIFAR-10 , 2010 .

[12] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[13] H. Robbins. A Stochastic Approximation Method , 1951 .

[14] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[15] Pritish Narayanan,et al. Deep Learning with Limited Numerical Precision , 2015, ICML.

[16] Jorge Nocedal,et al. Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[17] William J. Dally,et al. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[18] Yuanzhi Li,et al. Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[19] Amit Agarwal,et al. CNTK: Microsoft's Open-Source Deep-Learning Toolkit , 2016, KDD.

[20] James Demmel,et al. Reducing BERT Pre-Training Time from 3 Days to 76 Minutes , 2019, ArXiv.

[21] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[22] Ioannis Mitliagkas,et al. YellowFin and the Art of Momentum Tuning , 2017, MLSys.

[23] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24] Adam Coates,et al. Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[25] Boris Polyak,et al. Acceleration of stochastic approximation by averaging , 1992 .

[26] Matthias Weidlich,et al. Crossbow: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers , 2019, Proc. VLDB Endow..

[27] Quoc V. Le,et al. Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[28] Léon Bottou,et al. On-line learning and stochastic approximations , 1999 .

[29] Aaron Klein,et al. Efficient and Robust Automated Machine Learning , 2015, NIPS.

[30] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[31] Yang You,et al. Scaling SGD Batch Size to 32K for ImageNet Training , 2017, ArXiv.

[32] Jasper Snoek,et al. Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[33] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34] David A. Patterson,et al. A New Golden Age in Computer Architecture: Empowering the Machine-Learning Revolution , 2018, IEEE Micro.

[35] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[36] D. Ruppert,et al. Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[37] Boris Polyak. Some methods of speeding up the convergence of iteration methods , 1964 .

[38] Carlo Luschi,et al. Revisiting Small Batch Training for Deep Neural Networks , 2018, ArXiv.

[39] Lin Ma,et al. Self-Driving Database Management Systems , 2017, CIDR.

[40] Alexander J. Smola,et al. Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[41] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[42] Dario Amodei,et al. An Empirical Model of Large-Batch Training , 2018, ArXiv.

[43] Takuya Akiba,et al. Variance-based Gradient Compression for Efficient Distributed Deep Learning , 2018, ICLR.

[44] Zheng Zhang,et al. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[45] Brendan J. Frey,et al. Adaptive dropout for training deep neural networks , 2013, NIPS.

[46] Yuanzhi Li,et al. Can SGD Learn Recurrent Neural Networks with Provable Generalization? , 2019, NeurIPS.

[47] Jia Deng,et al. A large-scale hierarchical image database , 2009, CVPR 2009.

[48] Qingquan Song,et al. Efficient Neural Architecture Search with Network Morphism , 2018, ArXiv.