Stabilizing Equilibrium Models by Jacobian Regularization

Deep equilibrium networks (DEQs) are a new class of models that eschews traditional depth in favor of finding the fixed point of a single nonlinear layer. These models have been shown to achieve performance competitive with the stateof-the-art deep networks while using significantly less memory. Yet they are also slower, brittle to architectural choices, and introduce potential instability to the model. In this paper, we propose a regularization scheme for DEQ models that explicitly regularizes the Jacobian of the fixed-point update equations to stabilize the learning of equilibrium models. We show that this regularization adds only minimal computational cost, significantly stabilizes the fixed-point convergence in both forward and backward passes, and scales well to high-dimensional, realistic domains (e.g., WikiText-103 language modeling and ImageNet classification). Using this method, we demonstrate, for the first time, an implicit-depth model that runs with approximately the same speed and level of performance as popular conventional deep networks such as ResNet-101, while still maintaining the constant memory footprint and architectural simplicity of DEQs. Code is available here.

[1]  Uri M. Ascher,et al.  Improved Bounds on Sample Size for Implicit Matrix Trace Estimators , 2013, Found. Comput. Math..

[2]  Yee Whye Teh,et al.  Augmented Neural ODEs , 2019, NeurIPS.

[3]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[4]  Harris Drucker,et al.  Improving generalization performance using double backpropagation , 1992, IEEE Trans. Neural Networks.

[5]  Jiawei Han,et al.  Understanding the Difficulty of Training Transformers , 2020, EMNLP.

[6]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[7]  Liwei Wang,et al.  On Layer Normalization in the Transformer Architecture , 2020, ICML.

[8]  Jun Zhu,et al.  Implicit Normalizing Flows , 2021, ICLR.

[9]  David Duvenaud,et al.  Neural Ordinary Differential Equations , 2018, NeurIPS.

[10]  Razvan Pascanu,et al.  Relational recurrent neural networks , 2018, NeurIPS.

[11]  J. Zico Kolter,et al.  OptNet: Differentiable Optimization as a Layer in Neural Networks , 2017, ICML.

[12]  Frederick Tung,et al.  Multi-level Residual Networks from Dynamical Systems View , 2017, ICLR.

[13]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[14]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[15]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[16]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[17]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  M. Hutchinson A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines , 1989 .

[19]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[20]  J. Zico Kolter,et al.  Monotone operator equilibrium networks , 2020, NeurIPS.

[21]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[22]  Laurent El Ghaoui,et al.  Implicit Deep Learning , 2019, SIAM J. Math. Data Sci..

[23]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  R. Hartley,et al.  Deep Declarative Networks: A New Hope , 2019, ArXiv.

[25]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[26]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[27]  Donald G. M. Anderson Iterative Procedures for Nonlinear Integral Equations , 1965, JACM.

[28]  David P. Woodruff,et al.  Hutch++: Optimal Stochastic Trace Estimation , 2020, SOSA.

[29]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[30]  J. Zico Kolter,et al.  Estimating Lipschitz constants of monotone deep equilibrium models , 2021, ICLR.

[31]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[32]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[34]  David Duvenaud,et al.  FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models , 2018, ICLR.

[35]  C. G. Broyden A Class of Methods for Solving Nonlinear Simultaneous Equations , 1965 .

[36]  Yousef Saad,et al.  Fast Estimation of tr(f(A)) via Stochastic Lanczos Quadrature , 2017, SIAM J. Matrix Anal. Appl..

[37]  H. H. Rachford,et al.  The Numerical Solution of Parabolic and Elliptic Differential Equations , 1955 .

[38]  Vladlen Koltun,et al.  Deep Equilibrium Models , 2019, NeurIPS.

[39]  Jascha Sohl-Dickstein,et al.  Sensitivity and Generalization in Neural Networks: an Empirical Study , 2018, ICLR.

[40]  Sivan Toledo,et al.  Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix , 2011, JACM.

[41]  Matthew J. Johnson,et al.  Learning Differential Equations that are Easy to Solve , 2020, NeurIPS.

[42]  Thomas Serre,et al.  Stable and expressive recurrent vision models , 2020, NeurIPS.

[43]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[44]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[45]  Guillermo Sapiro,et al.  Robust Large Margin Deep Neural Networks , 2016, IEEE Transactions on Signal Processing.

[46]  Vladlen Koltun,et al.  Multiscale Deep Equilibrium Models , 2020, NeurIPS.

[47]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[48]  Kaiming He,et al.  Group Normalization , 2018, ECCV.

[49]  Alexei Baevski,et al.  Adaptive Input Representations for Neural Language Modeling , 2018, ICLR.

[50]  Judy Hoffman,et al.  Robust Learning with Jacobian Regularization , 2019, ArXiv.

[51]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[52]  Hajime Asama,et al.  Hypersolvers: Toward Fast Continuous-Depth Models , 2020, NeurIPS.

[53]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[54]  R. Mises,et al.  Praktische Verfahren der Gleichungsauflösung . , 1929 .