Mollifying Networks

The optimization of deep neural networks can be more challenging than the traditional convex optimization problems due to highly non-convex nature of the loss function, e.g. it can involve pathological landscapes such as saddle-surfaces that can be difficult to escape from for algorithms based on simple gradient descent. In this paper, we attack the problem of optimization of highly non-convex neural networks objectives by starting with a smoothed -- or mollified -- objective function which becomes more complex as the training proceeds. Our proposition is inspired by the recent studies in continuation methods: similarly to curriculum methods, we begin by learning an easier (possibly convex) objective function and let it evolve during training until it eventually becomes the original, difficult to optimize objective function. The complexity of the mollified networks is controlled by a single hyperparameter that is annealed during training. We show improvements on various difficult optimization tasks and establish a relationship between recent works on continuation methods for neural networks and mollifiers.

[1]  G. M.,et al.  Partial Differential Equations I , 2023, Applied Mathematical Sciences.

[2]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[3]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[4]  E. Allgower,et al.  Numerical Continuation Methods , 1990 .

[5]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[6]  Eugene L. Allgower,et al.  Numerical continuation methods - an introduction , 1990, Springer series in computational mathematics.

[7]  Geoffrey E. Hinton,et al.  Keeping Neural Networks Simple , 1993 .

[8]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[9]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[10]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[11]  Andrew Y. Ng,et al.  Shaping and policy search in reinforcement learning , 2003 .

[12]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[13]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[14]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[15]  Xiaojun Chen,et al.  Smoothing methods for nonsmooth, nonconvex minimization , 2012, Math. Program..

[16]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[17]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[18]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons , 2013, ArXiv.

[19]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[20]  Wojciech Zaremba,et al.  Learning to Execute , 2014, ArXiv.

[21]  Razvan Pascanu,et al.  Learned-Norm Pooling for Deep Feedforward and Recurrent Neural Networks , 2013, ECML/PKDD.

[22]  Yann LeCun,et al.  The Loss Surface of Multilayer Networks , 2014, ArXiv.

[23]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[24]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[25]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[26]  Quoc V. Le,et al.  Adding Gradient Noise Improves Learning for Very Deep Networks , 2015, ArXiv.

[27]  Yoshua Bengio,et al.  ReSeg: A Recurrent Neural Network for Object Segmentation , 2015, ArXiv.

[28]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[29]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[30]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[32]  Alex Graves,et al.  Adaptive Computation Time for Recurrent Neural Networks , 2016, ArXiv.

[33]  Hossein Mobahi,et al.  Training Recurrent Neural Networks by Diffusion , 2016, ArXiv.

[34]  Timothy Doster,et al.  Gradual DropIn of Layers to Train Very Deep Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[37]  Yoshua Bengio,et al.  Knowledge Matters: Importance of Prior Information for Optimization , 2013, J. Mach. Learn. Res..

[38]  Misha Denil,et al.  Noisy Activation Functions , 2016, ICML.

[39]  Alex Graves,et al.  Grid Long Short-Term Memory , 2015, ICLR.

[40]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[41]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.