Self-Attention Networks Can Process Bounded Hierarchical Languages

Despite their impressive performance in NLP, self-attention networks were recently proved to be limited for processing formal languages with hierarchical structure, such as Dyckk, the language consisting of well-nested parentheses of k types. This suggested that natural language can be approximated well with models that are too weak for formal languages, or that the role of hierarchy and recursion in natural language might be limited. We qualify this implication by proving that self-attention networks can process Dyckk,D, the subset of Dyckk with depth bounded by D, which arguably better captures the bounded hierarchical structure of natural language. Specifically, we construct a hard-attention network with D + 1 layers and O(log k) memory size (per token per layer) that recognizes Dyckk,D, and a soft-attention network with two layers and O(log k) memory size that generates Dyckk,D. Experiments show that self-attention networks trained on Dyckk,D generalize to longer inputs with near-perfect accuracy, and also verify the theoretical memory advantage of self-attention networks over recurrent networks.1

[1]  Noam Chomsky,et al.  The faculty of language: what is it, who has it, and how did it evolve? , 2002, Science.

[2]  Navin Goyal,et al.  On the Ability of Self-Attention Networks to Recognize Counter Languages , 2020, EMNLP.

[3]  Ankur Bapna,et al.  The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.

[4]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[5]  Noam Chomsky,et al.  The Algebraic Theory of Context-Free Languages* , 1963 .

[6]  Dan Jurafsky,et al.  Learning Music Helps You Read: Using Transfer to Study Linguistic Structure in Language Models , 2020, EMNLP.

[7]  Pablo Barceló,et al.  On the Turing Completeness of Modern Neural Network Architectures , 2019, ICLR.

[8]  Lane Schwartz,et al.  Unsupervised Grammar Induction with Depth-bounded PCFG , 2018, TACL.

[9]  Tao Shen,et al.  DiSAN: Directional Self-Attention Network for RNN/CNN-free Language Understanding , 2017, AAAI.

[10]  Surya Ganguli,et al.  RNNs Can Generate Bounded Hierarchical Languages with Optimal Memory , 2020, EMNLP.

[11]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[12]  Xing Wang,et al.  Modeling Recurrence for Transformer , 2019, NAACL.

[13]  Chris Quirk,et al.  Novel positional encodings to enable tree-based transformers , 2019, NeurIPS.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Navin Goyal,et al.  On the Computational Power of Transformers and Its Implications in Sequence Modeling , 2020, CONLL.

[16]  Yu Zhang,et al.  Fast and Accurate Neural CRF Constituency Parsing , 2020, IJCAI.

[17]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[18]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[19]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[20]  Wei Zhang,et al.  How Can Self-Attention Networks Recognize Dyck-n Languages? , 2020, FINDINGS.

[21]  Jean-Philippe Bernardy,et al.  Can Recurrent Neural Networks Learn Nested Recursion? , 2018, LILT.

[22]  Robert Frank,et al.  Open Sesame: Getting inside BERT’s Linguistic Knowledge , 2019, BlackboxNLP@ACL.

[23]  Lukasz Kaiser,et al.  Universal Transformers , 2018, ICLR.

[24]  Yonatan Belinkov,et al.  Memory-Augmented Recurrent Neural Networks Can Learn Generalized Dyck Languages , 2019, ArXiv.

[25]  Christof Monz,et al.  The Importance of Being Recurrent for Modeling Hierarchical Structure , 2018, EMNLP.

[26]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[27]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[28]  Morten H. Christiansen,et al.  How hierarchical is language use? , 2012, Proceedings of the Royal Society B: Biological Sciences.

[29]  Robert C. Berwick,et al.  Evaluating the Ability of LSTMs to Learn Context-Free Grammars , 2018, BlackboxNLP@EMNLP.

[30]  Noah A. Smith,et al.  A Formal Hierarchy of RNN Architectures , 2020, ACL.

[31]  Ankit Singh Rawat,et al.  Are Transformers universal approximators of sequence-to-sequence functions? , 2020, ICLR.

[32]  Omer Levy,et al.  Emergent linguistic structure in artificial neural networks trained by self-supervision , 2020, Proceedings of the National Academy of Sciences.

[33]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[34]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[35]  Alaa A. Kharbouch,et al.  Three models for the description of language , 1956, IRE Trans. Inf. Theory.

[36]  Mark Steijvers,et al.  A Recurrent Network that performs a Context-Sensitive Prediction Task , 1996 .

[37]  Michael Hahn,et al.  Theoretical Limitations of Self-Attention in Neural Sequence Models , 2019, TACL.

[38]  Stanislas Dehaene,et al.  Neurophysiological dynamics of phrase-structure building during sentence processing , 2017, Proceedings of the National Academy of Sciences.

[39]  Stephen C. Levinson,et al.  Pragmatics as the origin of recursion , 2014 .

[40]  Morten H. Christiansen,et al.  Hierarchical and sequential processing of language , 2018, Language, Cognition and Neuroscience.

[41]  Han He,et al.  Establishing Strong Baselines for the New Decade: Sequence Tagging, Syntactic and Semantic Parsing with BERT , 2019, FLAIRS.

[42]  Ngoc Thang Vu,et al.  Learning the Dyck Language with Attention-based Seq2Seq Models , 2019, BlackboxNLP@ACL.

[43]  Zhaopeng Tu,et al.  Assessing the Ability of Self-Attention Networks to Learn Word Order , 2019, ACL.

[44]  Jakob Grue Simonsen,et al.  Encoding word order in complex embeddings , 2019, ICLR.

[45]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[46]  Tie-Yan Liu,et al.  Rethinking Positional Encoding in Language Pre-training , 2020, ICLR.

[47]  Samuel A. Korsky,et al.  On the Computational Power of RNNs , 2019, ArXiv.

[48]  John T Hale,et al.  Hierarchical structure guides rapid linguistic predictions during naturalistic listening , 2019, PloS one.

[49]  Colin Giles,et al.  Learning Context-free Grammars: Capabilities and Limitations of a Recurrent Neural Network with an External Stack Memory (cid:3) , 1992 .