Self-Attention Networks Can Process Bounded Hierarchical Languages

Despite their impressive performance in NLP, self-attention networks were recently proved to be limited for processing formal languages with hierarchical structure, such as Dyck-k, the language consisting of well-nested parentheses of k types. This suggested that natural language can be approximated well with models that are too weak for formal languages, or that the role of hierarchy and recursion in natural language might be limited. We qualify this implication by proving that self-attention networks can process Dyck-(k, D), the subset of Dyck-k with depth bounded by D, which arguably better captures the bounded hierarchical structure of natural language. Specifically, we construct a hard-attention network with D+1 layers and O(log k) memory size (per token per layer) that recognizes Dyck-(k, D), and a soft-attention network with two layers and O(log k) memory size that generates Dyck-(k, D). Experiments show that self-attention networks trained on Dyck-(k, D) generalize to longer inputs with near-perfect accuracy, and also verify the theoretical memory advantage of self-attention networks over recurrent networks.

[1]  Tie-Yan Liu,et al.  Rethinking Positional Encoding in Language Pre-training , 2020, ICLR.

[2]  Surya Ganguli,et al.  RNNs Can Generate Bounded Hierarchical Languages with Optimal Memory , 2020, EMNLP.

[3]  Wei Zhang,et al.  How Can Self-Attention Networks Recognize Dyck-n Languages? , 2020, FINDINGS.

[4]  Navin Goyal,et al.  On the Ability of Self-Attention Networks to Recognize Counter Languages , 2020, EMNLP.

[5]  Yu Zhang,et al.  Fast and Accurate Neural CRF Constituency Parsing , 2020, IJCAI.

[6]  Navin Goyal,et al.  On the Computational Power of Transformers and Its Implications in Sequence Modeling , 2020, CONLL.

[7]  Omer Levy,et al.  Emergent linguistic structure in artificial neural networks trained by self-supervision , 2020, Proceedings of the National Academy of Sciences.

[8]  Dan Jurafsky,et al.  Learning Music Helps You Read: Using Transfer to Study Linguistic Structure in Language Models , 2020, EMNLP.

[9]  Noah A. Smith,et al.  A Formal Hierarchy of RNN Architectures , 2020, ACL.

[10]  Jakob Grue Simonsen,et al.  Encoding word order in complex embeddings , 2019, ICLR.

[11]  Sashank J. Reddi,et al.  Are Transformers universal approximators of sequence-to-sequence functions? , 2019, ICLR.

[12]  Han He,et al.  Establishing Strong Baselines for the New Decade: Sequence Tagging, Syntactic and Semantic Parsing with BERT , 2019, FLAIRS.

[13]  Michael Hahn,et al.  Theoretical Limitations of Self-Attention in Neural Sequence Models , 2019, TACL.

[14]  Yonatan Belinkov,et al.  Memory-Augmented Recurrent Neural Networks Can Learn Generalized Dyck Languages , 2019, ArXiv.

[15]  Chris Quirk,et al.  Novel positional encodings to enable tree-based transformers , 2019, NeurIPS.

[16]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[17]  Ngoc Thang Vu,et al.  Learning the Dyck Language with Attention-based Seq2Seq Models , 2019, BlackboxNLP@ACL.

[18]  Samuel A. Korsky,et al.  On the Computational Power of RNNs , 2019, ArXiv.

[19]  Robert Frank,et al.  Open Sesame: Getting inside BERT’s Linguistic Knowledge , 2019, BlackboxNLP@ACL.

[20]  Zhaopeng Tu,et al.  Assessing the Ability of Self-Attention Networks to Learn Word Order , 2019, ACL.

[21]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[22]  Xing Wang,et al.  Modeling Recurrence for Transformer , 2019, NAACL.

[23]  John T Hale,et al.  Hierarchical structure guides rapid linguistic predictions during naturalistic listening , 2019, PloS one.

[24]  Pablo Barceló,et al.  On the Turing Completeness of Modern Neural Network Architectures , 2019, ICLR.

[25]  Lukasz Kaiser,et al.  Universal Transformers , 2018, ICLR.

[26]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[27]  Robert C. Berwick,et al.  Evaluating the Ability of LSTMs to Learn Context-Free Grammars , 2018, BlackboxNLP@EMNLP.

[28]  Jean-Philippe Bernardy,et al.  Can Recurrent Neural Networks Learn Nested Recursion? , 2018, LILT.

[29]  Ankur Bapna,et al.  The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.

[30]  Christof Monz,et al.  The Importance of Being Recurrent for Modeling Hierarchical Structure , 2018, EMNLP.

[31]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[32]  Lane Schwartz,et al.  Unsupervised Grammar Induction with Depth-bounded PCFG , 2018, TACL.

[33]  Morten H. Christiansen,et al.  Hierarchical and sequential processing of language , 2018, Language, Cognition and Neuroscience.

[34]  Tao Shen,et al.  DiSAN: Directional Self-Attention Network for RNN/CNN-free Language Understanding , 2017, AAAI.

[35]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[36]  Stanislas Dehaene,et al.  Neurophysiological dynamics of phrase-structure building during sentence processing , 2017, Proceedings of the National Academy of Sciences.

[37]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[38]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[39]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Stephen C. Levinson,et al.  Pragmatics as the origin of recursion , 2014 .

[41]  Morten H. Christiansen,et al.  How hierarchical is language use? , 2012, Proceedings of the Royal Society B: Biological Sciences.

[42]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[43]  Alaa A. Kharbouch,et al.  Three models for the description of language , 1956, IRE Trans. Inf. Theory.

[44]  Noam Chomsky,et al.  The faculty of language: what is it, who has it, and how did it evolve? , 2002, Science.

[45]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[46]  Mark Steijvers,et al.  A Recurrent Network that performs a Context-Sensitive Prediction Task , 1996 .

[47]  Colin Giles,et al.  Learning Context-free Grammars: Capabilities and Limitations of a Recurrent Neural Network with an External Stack Memory (cid:3) , 1992 .

[48]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[49]  Noam Chomsky,et al.  The Algebraic Theory of Context-Free Languages* , 1963 .