Structured Attention Networks

Attention networks have proven to be an effective approach for embedding categorical inference within a deep neural network. However, for many tasks we may want to model richer structural dependencies without abandoning end-to-end training. In this work, we experiment with incorporating richer structural distributions, encoded using graphical models, within deep networks. We show that these structured attention networks are simple extensions of the basic attention procedure, and that they allow for extending attention beyond the standard soft-selection approach, such as attending to partial segmentations or to subtrees. We experiment with two different classes of structured attention networks: a linear-chain conditional random field and a graph-based parsing model, and describe how these models can be practically implemented as neural network layers. Experiments show that this approach is effective for incorporating structural biases, and structured attention networks outperform baseline attention models on a variety of synthetic and real tasks: tree transduction, neural machine translation, question answering, and natural language inference. We further find that models trained in this way learn interesting unsupervised hidden representations that generalize simple attention.

[1]  Jason Eisner,et al.  Three New Probabilistic Models for Dependency Parsing: An Exploration , 1996, COLING.

[2]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[3]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[4]  David A. Smith,et al.  Dependency Parsing by Belief Propagation , 2008, EMNLP.

[5]  Zhifei Li,et al.  First- and Second-Order Expectation Semirings with Applications to Minimum-Risk Training on Translation Forests , 2009, EMNLP.

[6]  Jian Peng,et al.  Conditional Neural Fields , 2009, NIPS.

[7]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[8]  Thierry Artières,et al.  Neural conditional random fields , 2010, AISTATS.

[9]  Veselin Stoyanov,et al.  Empirical Risk Minimization of Graphical Model Parameters Given Approximate Inference, Decoding, and Model Structure , 2011, AISTATS.

[10]  Justin Domke,et al.  Parameter learning with truncated message-passing , 2011, CVPR 2011.

[11]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[12]  Graham Neubig,et al.  Pointwise Prediction for Robust, Adaptable Japanese Morphological Analysis , 2011, ACL.

[13]  Veselin Stoyanov,et al.  Minimum-Risk Training of Approximate CRF-Based NLP Systems , 2012, NAACL.

[14]  Justin Domke,et al.  Generic Methods for Optimization-Based Modeling , 2012, AISTATS.

[15]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[16]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[17]  Christopher Potts,et al.  Tree-Structured Composition in Neural Networks without Tree-Structured Architectures , 2015, CoCo@NIPS.

[18]  Mark Dredze,et al.  Approximation-Aware Dependency Parsing by Belief Propagation , 2015, TACL.

[19]  Andrew Zisserman,et al.  Deep Structured Output Learning for Unconstrained Text Recognition , 2014, ICLR.

[20]  Pieter Abbeel,et al.  Gradient Estimation Using Stochastic Computation Graphs , 2015, NIPS.

[21]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[22]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[23]  Yoshua Bengio,et al.  Describing Multimedia Content Using Attention-Based Encoder-Decoder Networks , 2015, IEEE Transactions on Multimedia.

[24]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[25]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[26]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.

[27]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[29]  Dan Klein,et al.  Neural CRF Parsing , 2015, ACL.

[30]  Alan L. Yuille,et al.  Learning Deep Structured Models , 2014, ICML.

[31]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[32]  Quoc V. Le,et al.  Listen, Attend and Spell , 2015, ArXiv.

[33]  Ryan P. Adams,et al.  Gradient-based Hyperparameter Optimization through Reversible Learning , 2015, ICML.

[34]  Phil Blunsom,et al.  Learning to Transduce with Unbounded Memory , 2015, NIPS.

[35]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[36]  Zhen-Hua Ling,et al.  Enhancing and Combining Sequential and Tree LSTM for Natural Language Inference , 2016, ArXiv.

[37]  Jason Eisner,et al.  Inside-Outside and Forward-Backward Algorithms Are Just Backprop (tutorial paper) , 2016, SPNLP@EMNLP.

[38]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[39]  Phil Blunsom,et al.  Reasoning about Entailment with Neural Attention , 2015, ICLR.

[40]  Jakob Uszkoreit,et al.  A Decomposable Attention Model for Natural Language Inference , 2016, EMNLP.

[41]  Christopher Potts,et al.  A Fast Unified Model for Parsing and Sentence Understanding , 2016, ACL.

[42]  Slav Petrov,et al.  Globally Normalized Transition-Based Neural Networks , 2016, ACL.

[43]  Toshiaki Nakazawa,et al.  ASPEC: Asian Scientific Paper Excerpt Corpus , 2016, LREC.

[44]  Shuohang Wang,et al.  Learning Natural Language Inference with LSTM , 2015, NAACL.

[45]  Noah A. Smith,et al.  Segmental Recurrent Neural Networks , 2015, ICLR.

[46]  Sergio Gomez Colmenarejo,et al.  Hybrid computing using a neural network with dynamic external memory , 2016, Nature.

[47]  Xinyun Chen Under Review as a Conference Paper at Iclr 2017 Delving into Transferable Adversarial Ex- Amples and Black-box Attacks , 2016 .

[48]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[49]  Liang Lu,et al.  Segmental Recurrent Neural Networks for End-to-End Speech Recognition , 2016, INTERSPEECH.

[50]  Sanja Fidler,et al.  Proximal Deep Structured Models , 2016, NIPS.

[51]  Andrew McCallum,et al.  Structured Prediction Energy Networks , 2015, ICML.

[52]  Yan Pan,et al.  Modelling Sentence Pairs with Tree-structured Attentive Encoder , 2016, COLING.

[53]  Mingbo Ma,et al.  Textual Entailment with Structured Attentions and Composition , 2016, COLING.

[54]  Graham Neubig,et al.  Lexicons and Minimum Risk Training for Neural Machine Translation: NAIST-CMU at WAT2016 , 2016, WAT@COLING.

[55]  Rui Yan,et al.  Natural Language Inference by Tree-Based Convolution and Heuristic Matching , 2015, ACL.

[56]  Eliyahu Kiperwasser,et al.  Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations , 2016, TACL.

[57]  Lei Yu,et al.  Online Segment to Segment Neural Transduction , 2016, EMNLP.

[58]  Hong Yu,et al.  Neural Tree Indexers for Text Understanding , 2016, EACL.

[59]  Zhen-Hua Ling,et al.  Enhanced LSTM for Natural Language Inference , 2016, ACL.

[60]  Lei Yu,et al.  The Neural Noisy Channel , 2016, ICLR.

[61]  Omer Levy,et al.  Published as a conference paper at ICLR 2018 S IMULATING A CTION D YNAMICS WITH N EURAL P ROCESS N ETWORKS , 2018 .