论文信息 - Stabilizing Equilibrium Models by Jacobian Regularization

Stabilizing Equilibrium Models by Jacobian Regularization

Deep equilibrium networks (DEQs) are a new class of models that eschews traditional depth in favor of finding the fixed point of a single nonlinear layer. These models have been shown to achieve performance competitive with the stateof-the-art deep networks while using significantly less memory. Yet they are also slower, brittle to architectural choices, and introduce potential instability to the model. In this paper, we propose a regularization scheme for DEQ models that explicitly regularizes the Jacobian of the fixed-point update equations to stabilize the learning of equilibrium models. We show that this regularization adds only minimal computational cost, significantly stabilizes the fixed-point convergence in both forward and backward passes, and scales well to high-dimensional, realistic domains (e.g., WikiText-103 language modeling and ImageNet classification). Using this method, we demonstrate, for the first time, an implicit-depth model that runs with approximately the same speed and level of performance as popular conventional deep networks such as ResNet-101, while still maintaining the constant memory footprint and architectural simplicity of DEQs. Code is available here.

[1] Uri M. Ascher,et al. Improved Bounds on Sample Size for Implicit Matrix Trace Estimators , 2013, Found. Comput. Math..

[2] Yee Whye Teh,et al. Augmented Neural ODEs , 2019, NeurIPS.

[3] Tim Salimans,et al. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[4] Harris Drucker,et al. Improving generalization performance using double backpropagation , 1992, IEEE Trans. Neural Networks.

[5] Jiawei Han,et al. Understanding the Difficulty of Training Transformers , 2020, EMNLP.

[6] Richard Socher,et al. Pointer Sentinel Mixture Models , 2016, ICLR.

[7] Liwei Wang,et al. On Layer Normalization in the Transformer Architecture , 2020, ICML.

[8] Jun Zhu,et al. Implicit Normalizing Flows , 2021, ICLR.

[9] David Duvenaud,et al. Neural Ordinary Differential Equations , 2018, NeurIPS.

[10] Razvan Pascanu,et al. Relational recurrent neural networks , 2018, NeurIPS.

[11] J. Zico Kolter,et al. OptNet: Differentiable Optimization as a Layer in Neural Networks , 2017, ICML.

[12] Frederick Tung,et al. Multi-level Residual Networks from Dynamical Systems View , 2017, ICLR.

[13] Mohammad Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[14] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[15] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[16] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[17] Kilian Q. Weinberger,et al. Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] M. Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines , 1989 .

[19] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[20] J. Zico Kolter,et al. Monotone operator equilibrium networks , 2020, NeurIPS.

[21] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[22] Laurent El Ghaoui,et al. Implicit Deep Learning , 2019, SIAM J. Math. Data Sci..

[23] Sebastian Ramos,et al. The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] R. Hartley,et al. Deep Declarative Networks: A New Hope , 2019, ArXiv.

[25] Richard Socher,et al. Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[26] Zoubin Ghahramani,et al. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[27] Donald G. M. Anderson. Iterative Procedures for Nonlinear Integral Equations , 1965, JACM.

[28] David P. Woodruff,et al. Hutch++: Optimal Stochastic Trace Estimation , 2020, SOSA.

[29] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.