Assessing the Ability of Self-Attention Networks to Learn Word Order

Self-attention networks (SAN) have attracted a lot of interests due to their high parallelization and strong performance on a variety of NLP tasks, e.g. machine translation. Due to the lack of recurrence structure such as recurrent neural networks (RNN), SAN is ascribed to be weak at learning positional information of words for sequence modeling. However, neither this speculation has been empirically confirmed, nor explanations for their strong performances on machine translation tasks when "lacking positional information" have been explored. To this end, we propose a novel word reordering detection task to quantify how well the word order information learned by SAN and RNN. Specifically, we randomly move one word to another position, and examine whether a trained model can detect both the original and inserted positions. Experimental results reveal that: 1) SAN trained on word reordering detection indeed has difficulty learning the positional information even with the position embedding; and 2) SAN trained on machine translation learns better positional information than its RNN counterpart, in which position embedding plays a critical role. Although recurrence structure make the model more universally-effective on learning word order, learning objectives matter more in the downstream tasks such as machine translation.

[1]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[2]  Xing Shi,et al.  Does String-Based Neural MT Learn Source Syntax? , 2016, EMNLP.

[3]  Christof Monz,et al.  The Importance of Being Recurrent for Modeling Hierarchical Structure , 2018, EMNLP.

[4]  Shuohang Wang,et al.  Machine Comprehension Using Match-LSTM and Answer Pointer , 2016, ICLR.

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Jakob Uszkoreit,et al.  A Decomposable Attention Model for Natural Language Inference , 2016, EMNLP.

[7]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[8]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[9]  Quoc V. Le,et al.  Learning to Skim Text , 2017, ACL.

[10]  Xinya Du,et al.  Identifying Where to Focus in Reading Comprehension for Neural Question Generation , 2017, EMNLP.

[11]  Alex Wang,et al.  What do you learn from context? Probing for sentence structure in contextualized word representations , 2019, ICLR.

[12]  Yonatan Belinkov,et al.  What do Neural Machine Translation Models Learn about Morphology? , 2017, ACL.

[13]  N. Sebastián-Gallés,et al.  Evidence of Early Language Discrimination Abilities in Infants From Bilingual Environments. , 2001, Infancy : the official journal of the International Society on Infant Studies.

[14]  Jörg Tiedemann,et al.  An Analysis of Encoder Representations in Transformer-Based Machine Translation , 2018, BlackboxNLP@EMNLP.

[15]  Omer Levy,et al.  Deep RNNs Encode Soft Hierarchical Syntax , 2018, ACL.

[16]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.

[17]  Ankur Bapna,et al.  The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.

[18]  Wenpeng Yin,et al.  Comparative Study of CNN and RNN for Natural Language Processing , 2017, ArXiv.

[19]  Masaaki Nagata,et al.  NTT Neural Machine Translation Systems at WAT 2017 , 2019, WAT@IJCNLP.

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Graham Neubig,et al.  Pointwise Prediction for Robust, Adaptable Japanese Morphological Analysis , 2011, ACL.

[22]  Michael R. Lyu,et al.  Information Aggregation for Multi-Head Attention with Routing-by-Agreement , 2019, NAACL.

[23]  Tong Zhang,et al.  Modeling Localness for Self-Attention Networks , 2018, EMNLP.

[24]  Regina Barzilay,et al.  Rationalizing Neural Predictions , 2016, EMNLP.

[25]  Tao Shen,et al.  DiSAN: Directional Self-Attention Network for RNN/CNN-free Language Understanding , 2017, AAAI.

[26]  Zhaopeng Tu,et al.  Convolutional Self-Attention Networks , 2019, NAACL.

[27]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[28]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[29]  Yonatan Belinkov,et al.  Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , 2016, ICLR.

[30]  Kevin Duh,et al.  Automatic Evaluation of Translation Quality for Distant Language Pairs , 2010, EMNLP.

[31]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[32]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[33]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[34]  Yidong Chen,et al.  Deep Semantic Role Labeling with Self-Attention , 2017, AAAI.

[35]  Rico Sennrich,et al.  Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures , 2018, EMNLP.

[36]  Schumacher,et al.  Quantum data processing and error correction. , 1996, Physical review. A, Atomic, molecular, and optical physics.

[37]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[38]  Zhen Xu,et al.  Neural Response Generation via GAN with an Approximate Embedding Layer , 2017, EMNLP.

[39]  Xing Wang,et al.  Modeling Recurrence for Transformer , 2019, NAACL.

[40]  A. Waibel,et al.  Toward Robust Neural Machine Translation for Noisy Input Sequences , 2017, IWSLT.

[41]  Andrew McCallum,et al.  Linguistically-Informed Self-Attention for Semantic Role Labeling , 2018, EMNLP.

[42]  Chengqi Zhang,et al.  Reinforced Self-Attention Network: a Hybrid of Hard and Soft Attention for Sequence Modeling , 2018, IJCAI.

[43]  Jingbo Zhu,et al.  Towards Bidirectional Hierarchical Representations for Attention-based Neural Machine Translation , 2017, EMNLP.

[44]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[45]  Yang Liu,et al.  Towards Robust Neural Machine Translation , 2018, ACL.

[46]  Guillaume Lample,et al.  What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[47]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[48]  Jian Li,et al.  Multi-Head Attention with Disagreement Regularization , 2018, EMNLP.