On the importance of initialization and momentum in deep learning

Deep and recurrent neural networks (DNNs and RNNs respectively) are powerful models that were considered to be almost impossible to train using stochastic gradient descent with momentum. In this paper, we show that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs (on datasets with long-term dependencies) to levels of performance that were previously achievable only with Hessian-Free optimization. We find that both the initialization and the momentum are crucial since poorly initialized networks cannot be trained with momentum and well-initialized networks perform markedly worse when the momentum is absent or poorly tuned. Our success training these models suggests that previous attempts to train deep and recurrent neural networks from random initializations have likely failed due to poor initialization schemes. Furthermore, carefully tuned momentum methods suffice for dealing with the curvature issues in deep and recurrent network training objectives without the need for sophisticated second-order methods.

[1]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[2]  John E. Moody,et al.  Towards Faster Stochastic Gradient Search , 1991, NIPS.

[3]  W. Wiegerinck,et al.  Stochastic dynamics of learning with momentum in neural networks , 1994 .

[4]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[5]  Genevieve Orr,et al.  Dynamics and algorithms for stochastic search , 1996 .

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  J. van Leeuwen,et al.  Neural Networks: Tricks of the Trade , 2002, Lecture Notes in Computer Science.

[8]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[9]  Harald Haas,et al.  Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication , 2004, Science.

[10]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[11]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[12]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[13]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[14]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[15]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[16]  Ilya Sutskever,et al.  Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[17]  Ilya Sutskever,et al.  SUBWORD LANGUAGE MODELING WITH NEURAL NETWORKS , 2011 .

[18]  O. Chapelle Improved Preconditioner for Hessian Free Optimization , 2011 .

[19]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.

[20]  Ohad Shamir,et al.  Better Mini-Batch Algorithms via Accelerated Gradient Methods , 2011, NIPS.

[21]  Guanghui Lan,et al.  An optimal method for stochastic composite optimization , 2012, Math. Program..

[22]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[24]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[25]  Grgoire Montavon,et al.  Neural Networks: Tricks of the Trade , 2012, Lecture Notes in Computer Science.

[26]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[27]  Tapani Raiko,et al.  Deep Learning Made Easier by Linear Transformations in Perceptrons , 2012, AISTATS.

[28]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[30]  Ilya Sutskever,et al.  Training Deep and Recurrent Networks with Hessian-Free Optimization , 2012, Neural Networks: Tricks of the Trade.