Deep Online Convex Optimization with Gated Games

Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.

[1]  Yoram Singer,et al.  Using and combining predictors that specialize , 1997, STOC '97.

[2]  Michael P. Wellman,et al.  Economic reasoning and artificial intelligence , 2015, Science.

[3]  Alexander J. Smola,et al.  Fast Incremental Method for Nonconvex Optimization , 2016, ArXiv.

[4]  Yishay Mansour,et al.  From External to Internal Regret , 2005, J. Mach. Learn. Res..

[5]  Elad Hazan,et al.  Computational Equivalence of Fixed Points and No Regret Algorithms, and Convergence to Equilibria , 2007, NIPS.

[6]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[7]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[8]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[9]  Ohad Shamir,et al.  The Power of Depth for Feedforward Neural Networks , 2015, COLT.

[10]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[11]  Pierre Baldi,et al.  The dropout learning algorithm , 2014, Artif. Intell..

[12]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[13]  Patrick Gallinari,et al.  A Framework for the Cooperation of Learning Algorithms , 1990, NIPS.

[14]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[15]  Pierre-Luc Bacon Conditional computation in neural networks using a decision-theoretic approach , 2015 .

[16]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  R. Aumann Subjectivity and Correlation in Randomized Strategies , 1974 .

[18]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[19]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[20]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[21]  Pierre Baldi,et al.  Complex-Valued Autoencoders , 2011, Neural Networks.

[22]  Philip M. Long,et al.  Apple Tasting , 2000, Inf. Comput..

[23]  Nicolas Le Roux,et al.  Convex Neural Networks , 2005, NIPS.

[24]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[25]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[26]  L. Shapley,et al.  Potential Games , 1994 .

[27]  Xinhua Zhang,et al.  Convex Two-Layer Modeling , 2013, NIPS.

[28]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[29]  Haipeng Luo,et al.  Fast Convergence of Regularized Learning in Games , 2015, NIPS.

[30]  Geoffrey E. Hinton,et al.  On rectified linear units for speech processing , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Sergiu Hart,et al.  Regret-based continuous-time dynamics , 2003, Games Econ. Behav..

[32]  Zeyuan Allen Zhu,et al.  Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.

[33]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[34]  Wojciech Kotlowski,et al.  Follow the Leader with Dropout Perturbations , 2014, COLT.

[35]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[36]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[37]  Sida I. Wang,et al.  Dropout Training as Adaptive Regularization , 2013, NIPS.

[38]  Karthik Sridharan,et al.  Optimization, Learning, and Games with Predictable Sequences , 2013, NIPS.

[39]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[40]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[41]  Yann LeCun,et al.  The Loss Surface of Multilayer Networks , 2014, ArXiv.

[42]  David Balduzzi,et al.  Towards a learning-theoretic analysis of spike-timing dependent plasticity , 2012, NIPS.

[43]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[44]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[45]  David Balduzzi,et al.  Cortical prediction markets , 2014, AAMAS.

[46]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[47]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[48]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[49]  Tomer Koren,et al.  Open Problem: Fast Stochastic Exp-Concave Optimization , 2013, COLT.

[50]  Yoram Singer,et al.  Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity , 2016, NIPS.

[51]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[52]  Joachim M. Buhmann,et al.  Kickback Cuts Backprop's Red-Tape: Biologically Plausible Credit Assignment in Neural Networks , 2014, AAAI.

[53]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[54]  Huan Li,et al.  Accelerated Proximal Gradient Methods for Nonconvex Programming , 2015, NIPS.

[55]  David Balduzzi,et al.  Grammars for Games: A Gradient-Based, Game-Theoretic Framework for Optimization in Deep Learning , 2016, Front. Robot. AI.

[56]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[57]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[58]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Alexander J. Smola,et al.  Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[60]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[61]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[62]  Allan Pinkus,et al.  Multilayer Feedforward Networks with a Non-Polynomial Activation Function Can Approximate Any Function , 1991, Neural Networks.

[63]  Shie Mannor,et al.  Ensemble Robustness and Generalization of Stochastic Deep Learning Algorithms , 2016, ICLR.

[64]  J. Neumann,et al.  Theory of games and economic behavior , 1945, 100 Years of Math Milestones.

[65]  Constantinos Daskalakis,et al.  Near-optimal no-regret algorithms for zero-sum games , 2011, SODA '11.

[66]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[67]  Shie Mannor,et al.  Ensemble Robustness of Deep Learning Algorithms , 2016, ArXiv.

[68]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[69]  Matus Telgarsky,et al.  Benefits of Depth in Neural Networks , 2016, COLT.

[70]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1992, Math. Control. Signals Syst..

[71]  Yann LeCun,et al.  Open Problem: The landscape of the loss surfaces of multilayer networks , 2015, COLT.

[72]  Avrim Blum,et al.  Empirical Support for Winnow and Weighted-Majority Algorithms: Results on a Calendar Scheduling Domain , 2004, Machine Learning.

[73]  David Balduzzi,et al.  Randomized co-training: from cortical neurons to machine learning and back again , 2013, ArXiv.

[74]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[75]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Muhammad Ghifary,et al.  Strongly-Typed Recurrent Neural Networks , 2016, ICML.

[77]  Haipeng Luo,et al.  Efficient Second Order Online Learning by Sketching , 2016, NIPS.

[78]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[79]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[80]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[81]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[82]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[83]  William H. Sandholm,et al.  ON THE GLOBAL CONVERGENCE OF STOCHASTIC FICTITIOUS PLAY , 2002 .

[84]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[85]  Xinhua Zhang,et al.  Convex Deep Learning via Normalized Kernels , 2014, NIPS.

[86]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[87]  Joelle Pineau,et al.  Conditional Computation in Neural Networks for faster models , 2015, ArXiv.

[88]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[89]  Gábor Lugosi,et al.  Learning correlated equilibria in games with compact sets of strategies , 2007, Games Econ. Behav..

[90]  R. Vohra,et al.  Calibrated Learning and Correlated Equilibrium , 1996 .

[91]  Michael I. Jordan,et al.  Gradient Descent Converges to Minimizers , 2016, ArXiv.

[92]  Marc Teboulle,et al.  Proximal alternating linearized minimization for nonconvex and nonsmooth problems , 2013, Mathematical Programming.

[93]  Razvan Pascanu,et al.  On the number of inference regions of deep feed forward networks with piece-wise linear activations , 2013, ICLR.