Reinforced Self-Attention Network: a Hybrid of Hard and Soft Attention for Sequence Modeling

Many natural language processing tasks solely rely on sparse dependencies between a few tokens in a sentence. Soft attention mechanisms show promising performance in modeling local/global dependencies by soft probabilities between every two tokens, but they are not effective and efficient when applied to long sentences. By contrast, hard attention mechanisms directly select a subset of tokens but are difficult and inefficient to train due to their combinatorial nature. In this paper, we integrate both soft and hard attention into one context fusion model, "reinforced self-attention (ReSA)", for the mutual benefit of each other. In ReSA, a hard attention trims a sequence for a soft self-attention to process, while the soft attention feeds reward signals back to facilitate the training of the hard one. For this purpose, we develop a novel hard attention called "reinforced sequence sampling (RSS)", selecting tokens in parallel and trained via policy gradient. Using two RSS modules, ReSA efficiently extracts the sparse dependencies between each pair of selected tokens. We finally propose an RNN/CNN-free sentence-encoding model, "reinforced self-attention network (ReSAN)", solely based on ReSA. It achieves state-of-the-art performance on both Stanford Natural Language Inference (SNLI) and Sentences Involving Compositional Knowledge (SICK) datasets.

[1]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[2]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[3]  Hong Yu,et al.  Neural Semantic Encoders , 2016, EACL.

[4]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[5]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[6]  Quoc V. Le,et al.  Learning to Skim Text , 2017, ACL.

[7]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[8]  Yu Zhang,et al.  End-to-End Adversarial Memory Network for Cross-domain Sentiment Classification , 2017, IJCAI.

[9]  Christopher Potts,et al.  A Fast Unified Model for Parsing and Sentence Understanding , 2016, ACL.

[10]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[11]  Yoshua Bengio,et al.  Dynamic Neural Turing Machine with Continuous and Discrete Addressing Schemes , 2018, Neural Computation.

[12]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[13]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[14]  Wang Ling,et al.  Learning to Compose Words into Sentences with Reinforcement Learning , 2016, ICLR.

[15]  Mirella Lapata,et al.  Sentence Simplification with Deep Reinforcement Learning , 2017, EMNLP.

[16]  Yang Liu,et al.  Learning Natural Language Inference using Bidirectional LSTM model and Inner-Attention , 2016, ArXiv.

[17]  Regina Barzilay,et al.  Rationalizing Neural Predictions , 2016, EMNLP.

[18]  Tao Shen,et al.  DiSAN: Directional Self-Attention Network for RNN/CNN-free Language Understanding , 2017, AAAI.

[19]  Malvina Nissim,et al.  The Meaning Factory: Formal Semantics for Recognizing Textual Entailment and Determining Semantic Similarity , 2014, *SEMEVAL.

[20]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[21]  Man Lan,et al.  ECNU: One Stone Two Birds: Ensemble of Heterogenous Measures for Semantic Relatedness and Textual Entailment , 2014, *SEMEVAL.

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[24]  Hang Li,et al.  Neural Responding Machine for Short-Text Conversation , 2015, ACL.

[25]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[26]  Mohit Bansal,et al.  Shortcut-Stacked Sentence Encoders for Multi-Domain Inference , 2017, RepEval@EMNLP.

[27]  Jürgen Schmidhuber,et al.  Highway Networks , 2015, ArXiv.

[28]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[29]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[30]  Yi Yang,et al.  More is Less: A More Complicated Network with Less Inference Complexity , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Eunsol Choi,et al.  Coarse-to-Fine Question Answering for Long Documents , 2016, ACL.

[32]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[33]  Quoc V. Le,et al.  Massive Exploration of Neural Machine Translation Architectures , 2017, EMNLP.

[34]  Jianfeng Gao,et al.  End-to-End Task-Completion Neural Dialogue Systems , 2017, IJCNLP.

[35]  Jihun Choi,et al.  Learning to Compose Task-Specific Tree Structures , 2017, AAAI.

[36]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[39]  Yoshua Bengio,et al.  Dynamic Neural Turing Machine with Soft and Hard Addressing Schemes , 2016, ArXiv.

[40]  Yelong Shen,et al.  ReasoNet: Learning to Stop Reading in Machine Comprehension , 2016, CoCo@NIPS.

[41]  Qian Liu,et al.  Semantic Structure-Based Word Embedding by Incorporating Concept Convergence and Word Divergence , 2018, AAAI.

[42]  Zhen-Hua Ling,et al.  Recurrent Neural Network-Based Sentence Encoder with Gated Attention for Natural Language Inference , 2017, RepEval@EMNLP.

[43]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[44]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[45]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[46]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[47]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[48]  Tie-Yan Liu,et al.  Dual Learning for Machine Translation , 2016, NIPS.

[49]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[50]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.