Training Deep Neural Networks

The procedure for training neural networks with backpropagation is briefly introduced in Chapter 1 This chapter will expand on the description on Chapter 1 in several ways

[1]  Alex Krizhevsky,et al.  One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[2]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[3]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[4]  Misha Denil,et al.  Predicting Parameters in Deep Learning , 2014 .

[5]  D. E. Rumelhart,et al.  Learning internal representations by back-propagating errors , 1986 .

[6]  Hao Yu,et al.  Levenberg—Marquardt Training , 2011 .

[7]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[8]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[9]  David G. Luenberger,et al.  Linear and nonlinear programming , 1984 .

[10]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[11]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[12]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[13]  Razvan Pascanu,et al.  Natural Neural Networks , 2015, NIPS.

[14]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[15]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[16]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[17]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[18]  M. Hestenes,et al.  Methods of conjugate gradients for solving linear systems , 1952 .

[19]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[20]  Razvan Pascanu,et al.  Understanding the exploding gradient problem , 2012, ArXiv.

[21]  Tim Dettmers,et al.  8-Bit Approximations for Parallelism in Deep Learning , 2015, ICLR.

[22]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[23]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[24]  Tom Schaul,et al.  No more pesky learning rates , 2012, ICML.

[25]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[26]  H. Sebastian Seung,et al.  Permitted and Forbidden Sets in Symmetric Threshold-Linear Networks , 2003, Neural Computation.

[27]  Quoc V. Le,et al.  On optimization methods for deep learning , 2011, ICML.

[28]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[29]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[30]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[31]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[32]  Yixin Chen,et al.  Compressing Neural Networks with the Hashing Trick , 2015, ICML.

[33]  Paul J. Werbos,et al.  The Roots of Backpropagation: From Ordered Derivatives to Neural Networks and Political Forecasting , 1994 .

[34]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[35]  Robert A. Jacobs,et al.  Increased rates of convergence through learning rate adaptation , 1987, Neural Networks.

[36]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[37]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[38]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[39]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[40]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[41]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[42]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[43]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[44]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[45]  Mokhtar S. Bazaraa,et al.  Nonlinear Programming: Theory and Algorithms , 1993 .

[46]  Yann LeCun,et al.  Improving the convergence of back-propagation learning with second-order methods , 1989 .

[47]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[48]  Henry J. Kelley,et al.  Gradient Theory of Optimal Flight Paths , 1960 .

[49]  Ravindra K. Ahuja,et al.  Network Flows: Theory, Algorithms, and Applications , 1993 .

[50]  Kevin Leyton-Brown,et al.  Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms , 2012, KDD.

[51]  E. Polak,et al.  Computational methods in optimization : a unified approach , 1972 .

[52]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[53]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[54]  J. Shewchuk An Introduction to the Conjugate Gradient Method Without the Agonizing Pain , 1994 .

[55]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[56]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[57]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[59]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[60]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[61]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[62]  Ilya Sutskever,et al.  Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[63]  David D. Cox,et al.  Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures , 2013, ICML.

[64]  Hermann Ney,et al.  A Convergence Analysis of Log-Linear Training , 2011, NIPS.