Sparse and Continuous Attention Mechanisms

Exponential families are widely used in machine learning; they include many distributions in continuous and discrete domains (e.g., Gaussian, Dirichlet, Poisson, and categorical distributions via the softmax transformation). Distributions in each of these families have fixed support. In contrast, for finite domains, there has been recent work on sparse alternatives to softmax (e.g. sparsemax and alpha-entmax), which have varying support, being able to assign zero probability to irrelevant categories. This paper expands that work in two directions: first, we extend alpha-entmax to continuous domains, revealing a link with Tsallis statistics and deformed exponential families. Second, we introduce continuous-domain attention mechanisms, deriving efficient gradient backpropagation algorithms for alpha in {1,2}. Experiments on attention-based text classification, machine translation, and visual question answering illustrate the use of continuous attention in 1D and 2D, showing that it allows attending to time intervals and compact regions.

[1]  H. Jeffreys,et al.  Theory of probability , 1896 .

[2]  B. O. Koopman On distributions admitting a sufficient statistic , 1936 .

[3]  E. Pitman,et al.  Sufficient statistics and intrinsic accuracy , 1936, Mathematical Proceedings of the Cambridge Philosophical Society.

[4]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[5]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[6]  Jan Havrda,et al.  Quantification method of classification processes. Concept of structural a-entropy , 1967, Kybernetika.

[7]  V. A. Epanechnikov Non-Parametric Estimation of a Multivariate Probability Density , 1969 .

[8]  L. Brown Fundamentals of statistical exponential families: with applications in statistical decision theory , 1986 .

[9]  C. Tsallis Possible generalization of Boltzmann-Gibbs statistics , 1988 .

[10]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[11]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[12]  Yuichi Nakamura,et al.  Approximation of dynamical systems by continuous time recurrent neural networks , 1993, Neural Networks.

[13]  Mário A. T. Figueiredo Adaptive Sparseness Using Jeffreys Prior , 2001, NIPS.

[14]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevance Vector Machine , 2001 .

[15]  Sumiyoshi Abe Geometry of escort distributions. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[16]  L. Jost Entropy and diversity , 2006 .

[17]  J. Naudts The q-exponential family in statistical physics , 2008, 0809.4764.

[18]  S. V. N. Vishwanathan,et al.  T-logistic Regression , 2010, NIPS.

[19]  Timothy D. Sears Generalized Maximum Entropy, Convexity and Machine Learning , 2010 .

[20]  Shun-ichi Amari,et al.  Geometry of q-Exponential Family of Probability Distributions , 2011, Entropy.

[21]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[22]  Heinz H. Bauschke,et al.  Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[23]  A. Ohara,et al.  Geometry for q-exponential families , 2011 .

[24]  F. Opitz Information geometry and its applications , 2012, 2012 9th European Radar Conference.

[25]  O. Barndorff-Nielsen Information and Exponential Families in Statistical Theory , 1980 .

[26]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[27]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[28]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[29]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[30]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[31]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[32]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Ramón Fernández Astudillo,et al.  From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification , 2016, ICML.

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  Klaus-Robert Müller,et al.  SchNet: A continuous-filter convolutional neural network for modeling quantum interactions , 2017, NIPS.

[36]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Mauro Cettolo,et al.  Overview of the IWSLT 2017 Evaluation Campaign , 2017, IWSLT.

[38]  David Duvenaud,et al.  Neural Ordinary Differential Equations , 2018, NeurIPS.

[39]  Raquel Urtasun,et al.  Deep Parametric Continuous Convolutional Neural Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Noah A. Smith,et al.  Is Attention Interpretable? , 2019, ACL.

[42]  Byron C. Wallace,et al.  Attention is not Explanation , 2019, NAACL.

[43]  André F. T. Martins,et al.  Sparse Sequence-to-Sequence Models , 2019, ACL.

[44]  Yulia Tsvetkov,et al.  Von Mises-Fisher Loss for Training Sequence to Sequence Models with Continuous Outputs , 2018, ICLR.

[45]  Jieyu Zhao,et al.  Balanced Datasets Are Not Enough: Estimating and Mitigating Gender Bias in Deep Image Representations , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Stefan Riezler,et al.  Joey NMT: A Minimalist NMT Toolkit for Novices , 2019, EMNLP.

[47]  Zhou Yu,et al.  Deep Modular Co-Attention Networks for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Yuval Pinter,et al.  Attention is not not Explanation , 2019, EMNLP.

[49]  David Duvenaud,et al.  Latent Ordinary Differential Equations for Irregularly-Sampled Time Series , 2019, NeurIPS.

[50]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[51]  André F. T. Martins,et al.  Adaptively Sparse Transformers , 2019, EMNLP.

[52]  Mohit Iyyer,et al.  Hard-Coded Gaussian Attention for Neural Machine Translation , 2020, ACL.

[53]  Martin Jaggi,et al.  On the Relationship between Self-Attention and Convolutional Layers , 2019, ICLR.

[54]  André F. T. Martins,et al.  Learning with Fenchel-Young Losses , 2019, J. Mach. Learn. Res..