The Lipschitz Constant of Self-Attention

Lipschitz constants of neural networks have been explored in various contexts in deep learning, such as provable adversarial robustness, estimating Wasserstein distance, stabilising training of GANs, and formulating invertible neural networks. Such works have focused on bounding the Lipschitz constant of fully connected or convolutional networks, composed of linear maps and pointwise non-linearities. In this paper, we investigate the Lipschitz constant of self-attention, a non-linear neural network module widely used in sequence modelling. We prove that the standard dot-product self-attention is not Lipschitz, and propose an alternative L2 self-attention that is Lipschitz. We derive an upper bound on the Lipschitz constant of L2 self-attention and provide empirical evidence for its asymptotic tightness. To demonstrate the practical relevance of the theory, we formulate invertible self-attention and use it in a Transformer-based architecture for a character-level language modelling task.

[1]  Manfred Morari,et al.  Efficient and Accurate Estimation of Lipschitz Constants for Deep Neural Networks , 2019, NeurIPS.

[2]  David Duvenaud,et al.  Residual Flows for Invertible Generative Modeling , 2019, NeurIPS.

[3]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[5]  Avraham Adler,et al.  Lambert-W Function , 2015 .

[6]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[7]  Gabriel Peyré,et al.  Computational Optimal Transport , 2018, Found. Trends Mach. Learn..

[8]  Guillermo Sapiro,et al.  Robust Large Margin Deep Neural Networks , 2017, IEEE Transactions on Signal Processing.

[9]  Sepp Hochreiter,et al.  The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[10]  Makoto Yamada,et al.  Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel , 2019, EMNLP/IJCNLP.

[11]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[12]  David Duvenaud,et al.  Neural Ordinary Differential Equations , 2018, NeurIPS.

[13]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[14]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[15]  Ankur Bapna,et al.  Training Deeper Neural Machine Translation Models with Transparent Attention , 2018, EMNLP.

[16]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[17]  Kevin Scaman,et al.  Lipschitz regularity of deep neural networks: analysis and efficient estimation , 2018, NeurIPS.

[18]  Masashi Sugiyama,et al.  Lipschitz-Margin Training: Scalable Certification of Perturbation Invariance for Deep Neural Networks , 2018, NeurIPS.

[19]  Moustapha Cissé,et al.  Parseval Networks: Improving Robustness to Adversarial Examples , 2017, ICML.

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[22]  David Duvenaud,et al.  Invertible Residual Networks , 2018, ICML.

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Paul Rolland,et al.  Lipschitz constant estimation of Neural Networks via sparse polynomial optimization , 2020, ICLR.

[25]  Ashish Vaswani,et al.  Stand-Alone Self-Attention in Vision Models , 2019, NeurIPS.

[26]  Cem Anil,et al.  Sorting out Lipschitz function approximation , 2018, ICML.

[27]  Razvan Pascanu,et al.  Stabilizing Transformers for Reinforcement Learning , 2019, ICML.

[28]  R. Mises,et al.  Praktische Verfahren der Gleichungsauflösung . , 1929 .

[29]  David Duvenaud,et al.  FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models , 2018, ICLR.