On Training Implicit Models

This paper focuses on training implicit models of infinite layers. Specifically, previous works employ implicit differentiation and solve the exact gradient for the backward propagation. However, is it necessary to compute such an exact but expensive gradient for training? To this end, we propose a novel gradient estimate for implicit models, named phantom gradient, that 1) forgoes the costly computation of the exact gradient; and 2) provides an update direction empirically preferable to the implicit model training. We theoretically analyze the condition under which an ascent direction of the loss landscape could be found, and provide two specific instantiations of the phantom gradient based on the damped unrolling and Neumann series. Experiments on large-scale tasks demonstrate that these lightweight phantom gradients significantly accelerate the backward passes in training implicit models by roughly 1.7×, and even boost the performance over approaches based on the exact gradient on ImageNet.

[1]  Fabian Pedregosa,et al.  Hyperparameter optimization with approximate gradient , 2016, ICML.

[2]  Xiangyu Zhang,et al.  Implicit Feature Pyramid Network for Object Detection , 2020, ArXiv.

[3]  Justin Domke,et al.  Generic Methods for Optimization-Based Modeling , 2012, AISTATS.

[4]  Kenji Kawaguchi,et al.  On the Theory of Implicit Deep Learning: Global Convergence with Implicit Layers , 2021, ICLR.

[5]  Priya L. Donti,et al.  SATNet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver , 2019, ICML.

[6]  Ryan P. Adams,et al.  Gradient-based Hyperparameter Optimization through Reversible Learning , 2015, ICML.

[7]  Zheng Xu,et al.  Training Neural Networks Without Gradients: A Scalable ADMM Approach , 2016, ICML.

[8]  Konrad Paul Kording,et al.  Learning to solve the credit assignment problem , 2019, ICLR.

[9]  Andreas Krause,et al.  Differentiable Learning of Submodular Models , 2017, NIPS 2017.

[10]  Yiming Yang,et al.  DARTS: Differentiable Architecture Search , 2018, ICLR.

[11]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[12]  Venkatesh Saligrama,et al.  Efficient Training of Very Deep Neural Networks for Supervised Hashing , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  David Duvenaud,et al.  Neural Ordinary Differential Equations , 2018, NeurIPS.

[14]  Alex Graves,et al.  Decoupled Neural Interfaces using Synthetic Gradients , 2016, ICML.

[15]  Stephen Gould,et al.  Deep Declarative Networks , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[17]  Benjamin Pfaff,et al.  Perturbation Analysis Of Optimization Problems , 2016 .

[18]  S. Banach Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales , 1922 .

[19]  Laurent El Ghaoui,et al.  Implicit Deep Learning , 2019, SIAM J. Math. Data Sci..

[20]  Yee Whye Teh,et al.  Augmented Neural ODEs , 2019, NeurIPS.

[21]  Yoshua Bengio,et al.  How Auto-Encoders Could Provide Credit Assignment in Deep Networks via Target Propagation , 2014, ArXiv.

[22]  David Duvenaud,et al.  Residual Flows for Invertible Generative Modeling , 2019, NeurIPS.

[23]  Laurent El Ghaoui,et al.  Fenchel Lifted Networks: A Lagrange Relaxation of Neural Network Training , 2018, AISTATS.

[24]  Xia Li,et al.  Optimization Induced Equilibrium Networks , 2021, ArXiv.

[25]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[26]  Byron Boots,et al.  Truncated Back-propagation for Bilevel Optimization , 2018, AISTATS.

[27]  M. L. Chambers The Mathematical Theory of Optimal Processes , 1965 .

[28]  Guozhong An,et al.  The Effects of Adding Noise During Backpropagation Training on a Generalization Performance , 1996, Neural Computation.

[29]  David Duvenaud,et al.  Invertible Residual Networks , 2018, ICML.

[30]  Jia Li,et al.  Lifted Proximal Operator Machines , 2018, AAAI.

[31]  Samy Wu Fung,et al.  Fixed Point Networks: Implicit Depth Models with Jacobian-Free Backprop , 2021, ArXiv.

[32]  Vladlen Koltun,et al.  Stabilizing Equilibrium Models by Jacobian Regularization , 2021, ICML.

[33]  Milind Tambe,et al.  End to end learning and optimization on graphs , 2019, NeurIPS.

[34]  Vladlen Koltun,et al.  Deep Equilibrium Models , 2019, NeurIPS.

[35]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[36]  Zhanxing Zhu,et al.  On the Noisy Gradient Descent that Generalizes as SGD , 2019, ICML.

[37]  Paolo Frasconi,et al.  Forward and Reverse Gradient-Based Hyperparameter Optimization , 2017, ICML.

[38]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[39]  Hongxu Chen,et al.  Is Attention Better Than Matrix Decomposition? , 2021, ICLR.

[40]  Vladlen Koltun,et al.  Multiscale Deep Equilibrium Models , 2020, NeurIPS.

[41]  Sergey Levine,et al.  Meta-Learning with Implicit Gradients , 2019, NeurIPS.

[42]  Ming-Hsuan Yang,et al.  Semi-Supervised Learning with Meta-Gradient , 2020, ArXiv.

[43]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[44]  Max Jaderberg,et al.  Understanding Synthetic Gradients and Decoupled Neural Interfaces , 2017, ICML.

[45]  David Duvenaud,et al.  Optimizing Millions of Hyperparameters by Implicit Differentiation , 2019, AISTATS.

[46]  J. Zico Kolter,et al.  Estimating Lipschitz constants of monotone deep equilibrium models , 2021, ICLR.

[47]  Mehrdad Yazdani,et al.  Linear Backprop in non-linear networks , 2018 .

[48]  Miguel Á. Carreira-Perpiñán,et al.  Distributed optimization of deeply nested systems , 2012, AISTATS.

[49]  Xavier Gastaldi,et al.  Shake-Shake regularization , 2017, ArXiv.

[50]  Laurent El Ghaoui,et al.  Implicit Graph Neural Networks , 2020, NeurIPS.

[51]  Jun Zhu,et al.  Implicit Normalizing Flows , 2021, ICLR.

[52]  Zhanxing Zhu,et al.  The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects , 2018, ICML.

[53]  Yoshua Bengio,et al.  Difference Target Propagation , 2014, ECML/PKDD.

[54]  Benjamin F. Grewe,et al.  A Theoretical Framework for Target Propagation , 2020, NeurIPS.

[55]  Tapani Raiko,et al.  Scalable Gradient-Based Tuning of Continuous Regularization Hyperparameters , 2015, ICML.

[56]  Tim Tsz-Kit Lau,et al.  Global Convergence in Deep Learning with Variable Splitting via the Kurdyka-{\L}ojasiewicz Property , 2018 .

[57]  C. G. Broyden A Class of Methods for Solving Nonlinear Simultaneous Equations , 1965 .

[58]  Ziming Zhang,et al.  Convergent Block Coordinate Descent for Training Tikhonov Regularized Deep Neural Networks , 2017, NIPS.

[59]  J. Zico Kolter,et al.  Monotone operator equilibrium networks , 2020, NeurIPS.

[60]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[61]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[62]  Georg Martius,et al.  Differentiation of Blackbox Combinatorial Solvers , 2020, ICLR.

[63]  R. Hartley,et al.  Deep Declarative Networks: A New Hope , 2019, ArXiv.

[64]  Ian R. Manchester,et al.  Lipschitz Bounded Equilibrium Networks , 2020, ArXiv.

[65]  J. Zico Kolter,et al.  OptNet: Differentiable Optimization as a Layer in Neural Networks , 2017, ICML.