Beyond Backprop: Online Alternating Minimization with Auxiliary Variables

Despite significant recent advances in deep neural networks, training them remains a challenge due to the highly non-convex nature of the objective function. State-of-the-art methods rely on error backpropagation, which suffers from several well-known issues, such as vanishing and exploding gradients, inability to handle non-differentiable nonlinearities and to parallelize weight-updates across layers, and biological implausibility. These limitations continue to motivate exploration of alternative training algorithms, including several recently proposed auxiliary-variable methods which break the complex nested objective function into local subproblems. However, those techniques are mainly offline (batch), which limits their applicability to extremely large datasets, as well as to online, continual or reinforcement learning. The main contribution of our work is a novel online (stochastic/mini-batch) alternating minimization (AM) approach for training deep neural networks, together with the first theoretical convergence guarantees for AM in stochastic settings and promising empirical results on a variety of architectures and datasets.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[3]  Mark W. Schmidt,et al.  Fast Optimization Methods for L1 Regularization: A Comparative Study and Two New Approaches , 2007, ECML.

[4]  Sebastian Thrun,et al.  Lifelong Learning Algorithms , 1998, Learning to Learn.

[5]  Quoc V. Le,et al.  On optimization methods for deep learning , 2011, ICML.

[6]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[7]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[8]  Mark B. Ring Continual learning in reinforcement environments , 1995, GMD-Bericht.

[9]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[10]  Martin Jaggi,et al.  Decoupling Backpropagation using Constrained Optimization Methods , 2018 .

[11]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[12]  Thomas Villmann,et al.  Applications of lp-Norms and their Smooth Approximations for Gradient Based Learning Vector Quantization , 2014, ESANN.

[13]  Yoshua Bengio,et al.  Dendritic cortical microcircuits approximate the backpropagation algorithm , 2018, NeurIPS.

[14]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[15]  Geoffrey E. Hinton,et al.  Assessing the Scalability of Biologically-Motivated Deep Learning Algorithms and Architectures , 2018, NeurIPS.

[16]  H. Robbins A Stochastic Approximation Method , 1951 .

[17]  Tim Tsz-Kit Lau,et al.  Global Convergence in Deep Learning with Variable Splitting via the Kurdyka-{\L}ojasiewicz Property , 2018 .

[18]  Miguel Á. Carreira-Perpiñán,et al.  Distributed optimization of deeply nested systems , 2012, AISTATS.

[19]  Timothy P Lillicrap,et al.  Towards deep learning with segregated dendrites , 2016, eLife.

[20]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[21]  Yuan Yao,et al.  Block Coordinate Descent for Deep Learning: Unified Convergence Guarantees , 2018, ArXiv.

[22]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[23]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[24]  Venkatesh Saligrama,et al.  Efficient Training of Very Deep Neural Networks for Supervised Hashing , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Yoshua Bengio,et al.  Difference Target Propagation , 2014, ECML/PKDD.

[26]  Yuan Yao,et al.  A Proximal Block Coordinate Descent Algorithm for Deep Neural Network Training , 2018, ICLR.

[27]  Geoffrey E. Hinton,et al.  A Simple Way to Initialize Recurrent Networks of Rectified Linear Units , 2015, ArXiv.

[28]  Sebastian Thrun,et al.  A lifelong learning perspective for mobile robot control , 1994, Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS'94).

[29]  Yann LeCun,et al.  Modeles connexionnistes de l'apprentissage , 1987 .

[30]  Yann LeCun,et al.  Learning processes in an asymmetric threshold network , 1986 .

[31]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[32]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[33]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[34]  Y. L. Cun Learning Process in an Asymmetric Threshold Network , 1986 .

[35]  James C. R. Whittington,et al.  Theories of Error Back-Propagation in the Brain , 2019, Trends in Cognitive Sciences.

[36]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[37]  Martin J. Wainwright,et al.  Statistical guarantees for the EM algorithm: From population to sample-based analysis , 2014, ArXiv.

[38]  W. Bastiaan Kleijn,et al.  Training Deep Neural Networks via Optimization Over Graphs , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Guillermo Sapiro,et al.  Online dictionary learning for sparse coding , 2009, ICML '09.

[40]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[41]  Ziming Zhang,et al.  Convergent Block Coordinate Descent for Training Tikhonov Regularized Deep Neural Networks , 2017, NIPS.

[42]  Michael Möller,et al.  Proximal Backpropagation , 2017, ICLR.

[43]  John J. Hopfield,et al.  Unsupervised learning by competing hidden units , 2018, Proceedings of the National Academy of Sciences.

[44]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[45]  Zheng Xu,et al.  Training Neural Networks Without Gradients: A Scalable ADMM Approach , 2016, ICML.

[46]  Yann Le Cun,et al.  A Theoretical Framework for Back-Propagation , 1988 .