Latent Alignment and Variational Attention

Neural attention has become central to many state-of-the-art models in natural language processing and related domains. Attention networks are an easy-to-train and effective method for softly simulating alignment; however, the approach does not marginalize over latent alignments in a probabilistic sense. This property makes it difficult to compare attention to other alignment approaches, to compose it with probabilistic models, and to perform posterior inference conditioned on observed data. A related latent approach, hard attention, fixes these issues, but is generally harder to train and less accurate. This work considers variational attention networks, alternatives to soft and hard attention for learning latent variable alignment models, with tighter approximation bounds based on amortized variational inference. We further propose methods for reducing the variance of gradients to make these approaches computationally feasible. Experiments show that for machine translation and visual question answering, inefficient exact latent variable models outperform standard neural attention, but these gains go away when using hard attention based training. On the other hand, variational attention retains most of the performance gain but with training speed comparable to neural attention.

[1]  Ryan Cotterell,et al.  Hard Non-Monotonic Attention for Character-Level Transduction , 2018, EMNLP.

[2]  Martin Jankowiak,et al.  Pathwise Derivatives Beyond the Reparameterization Trick , 2018, ICML.

[3]  Trevor Cohn,et al.  A Stochastic Decoder for Neural Machine Translation , 2018, ACL.

[4]  Jason Lee,et al.  Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement , 2018, EMNLP.

[5]  Claire Cardie,et al.  SparseMAP: Differentiable Sparse Structured Inference , 2018, ICML.

[6]  Arthur Mensch,et al.  Differentiable Dynamic Programming for Structured Prediction and Attention , 2018, ICML.

[7]  Alexander M. Rush,et al.  Semi-Amortized Variational Autoencoders , 2018, ICML.

[8]  Shan Wu,et al.  Variational Recurrent Neural Machine Translation , 2018, AAAI.

[9]  Pascal Poupart,et al.  Variational Attention for Sequence-to-Sequence Models , 2017, COLING.

[10]  Yoshua Bengio,et al.  Z-Forcing: Training Stochastic Recurrent Networks , 2017, NIPS.

[11]  Marc'Aurelio Ranzato,et al.  Classical Structured Prediction Losses for Sequence to Sequence Learning , 2017, NAACL.

[12]  David Duvenaud,et al.  Backpropagation through the Void: Optimizing control variates for black-box gradient estimation , 2017, ICLR.

[13]  Matthew D. Hoffman,et al.  On the challenges of learning with inference networks on sparse, high-dimensional data , 2017, AISTATS.

[14]  Jörg Bornschein,et al.  Variational Memory Addressing in Generative Models , 2017, NIPS.

[15]  Kewei Tu,et al.  Structured Attentions for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Tommi S. Jaakkola,et al.  A causal framework for explaining the predictions of black-box sequence-to-sequence models , 2017, EMNLP.

[18]  Rico Sennrich,et al.  Proceedings of the Second Conference on Machine Translation, Volume 1: Research Papers , 2017 .

[19]  Chong Wang,et al.  Towards Neural Phrase-based Machine Translation , 2017, ICLR.

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Philipp Koehn,et al.  Six Challenges for Neural Machine Translation , 2017, NMT@ACL.

[22]  Yang Liu,et al.  Learning Structured Text Representations , 2017, TACL.

[23]  Colin Raffel,et al.  Learning Hard Alignments with Variational Inference , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Bonggun Shin,et al.  Classification of radiology reports using neural attention models , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[25]  Vlad Niculae,et al.  A Regularized Framework for Sparse and Structured Neural Attention , 2017, NIPS.

[26]  Colin Raffel,et al.  Online and Linear-Time Attention by Enforcing Monotonic Alignments , 2017, ICML.

[27]  Jascha Sohl-Dickstein,et al.  REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models , 2017, NIPS.

[28]  Charles A. Sutton,et al.  Autoencoding Variational Inference For Topic Models , 2017, ICLR.

[29]  Alexander M. Rush,et al.  Structured Attention Networks , 2017, ICLR.

[30]  Lei Yu,et al.  The Neural Noisy Channel , 2016, ICLR.

[31]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[32]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[33]  Roman Novak,et al.  Iterative Refinement for Machine Translation , 2016, ArXiv.

[34]  Uri Shalit,et al.  Structured Inference Networks for Nonlinear State Space Models , 2016, AAAI.

[35]  Yaoliang Yu,et al.  Dropout with Expectation-linear Regularization , 2016, ICLR.

[36]  Lei Yu,et al.  Online Segment to Segment Neural Transduction , 2016, EMNLP.

[37]  Quoc V. Le,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[38]  Alexander M. Rush,et al.  Image-to-Markup Generation with Coarse-to-Fine Attention , 2016, ICML.

[39]  Joelle Pineau,et al.  An Actor-Critic Algorithm for Sequence Prediction , 2016, ICLR.

[40]  Yoshua Bengio,et al.  Dynamic Neural Turing Machine with Soft and Hard Addressing Schemes , 2016, ArXiv.

[41]  Alexander M. Rush,et al.  Sequence-to-Sequence Learning as Beam-Search Optimization , 2016, EMNLP.

[42]  Min Zhang,et al.  Variational Neural Machine Translation , 2016, EMNLP.

[43]  Søren Kaae Sønderby,et al.  Sequential Neural Models with Stochastic Layers , 2016, NIPS.

[44]  Joelle Pineau,et al.  A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues , 2016, AAAI.

[45]  Zhengdong Lu,et al.  Incorporating Copying Mechanism in Sequence-to-Sequence Learning , 2016, ACL.

[46]  Andriy Mnih,et al.  Variational Inference for Monte Carlo Objectives , 2016, ICML.

[47]  Ramón Fernández Astudillo,et al.  From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification , 2016, ICML.

[48]  Yang Liu,et al.  Modeling Coverage for Neural Machine Translation , 2016, ACL.

[49]  Gholamreza Haffari,et al.  Incorporating Structural Alignment Biases into an Attentional Neural Translation Model , 2016, NAACL.

[50]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Phil Blunsom,et al.  Reasoning about Entailment with Neural Attention , 2015, ICLR.

[53]  Brendan J. Frey,et al.  Learning Wake-Sleep Recurrent Attention Models , 2015, NIPS.

[54]  Alexander M. Rush,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[55]  Ruslan Salakhutdinov,et al.  Importance Weighted Autoencoders , 2015, ICLR.

[56]  Alexandra Birch,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[57]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[58]  Quoc V. Le,et al.  Listen, Attend and Spell , 2015, ArXiv.

[59]  Yoshua Bengio,et al.  Describing Multimedia Content Using Attention-Based Encoder-Decoder Networks , 2015, IEEE Transactions on Multimedia.

[60]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[61]  Yoshua Bengio,et al.  A Recurrent Latent Variable Model for Sequential Data , 2015, NIPS.

[62]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[63]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[64]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[65]  Koray Kavukcuoglu,et al.  Multiple Object Recognition with Visual Attention , 2014, ICLR.

[66]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[67]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[68]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[69]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[70]  Karol Gregor,et al.  Neural Variational Inference and Learning in Belief Networks , 2014, ICML.

[71]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[72]  Sean Gerrish,et al.  Black Box Variational Inference , 2013, AISTATS.

[73]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[74]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[75]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[76]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[77]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[78]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[79]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[80]  Sunita Sarawagi,et al.  Surprisingly Easy Hard-Attention for Sequence to Sequence Learning , 2018, EMNLP.

[81]  Marcello Federico,et al.  Report on the 10th IWSLT evaluation campaign , 2013, IWSLT.

[82]  Chris Callison-Burch,et al.  Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding , 2006 .