Trees in transformers: a theoretical analysis of the Transformer's ability to represent trees

Transformer networks (Vaswani et al., 2017) are the de facto standard architecture in natural language processing. To date, there are no theoretical analyses of the Transformer’s ability to capture tree structures. We focus on the ability of Transformer networks to learn tree structures that are important for tree transduction problems. We first analyze the theoretical capability of the standard Transformer architecture to learn tree structures given enumeration of all possible tree backbones, which we define as trees without labels. We then prove that two linear layers with ReLU activation function can recover any tree backbone from any two nonzero, linearly independent starting backbones. This implies that a Transformer can learn tree structures well in theory. We conduct experiments with synthetic data and find that the standard Transformer achieves similar accuracy compared to a Transformer where tree position information is explicitly encoded, albeit with slower convergence. This confirms empirically that Transformers can learn tree structures.

[1]  P. Barceló,et al.  Attention is Turing-Complete , 2021, J. Mach. Learn. Res..

[2]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[3]  Akihiro Tamura,et al.  Dependency-Based Relative Positional Encoding for Transformer NMT , 2019, RANLP.

[4]  Robert Frank,et al.  Open Sesame: Getting inside BERT’s Linguistic Knowledge , 2019, BlackboxNLP@ACL.

[5]  Benoît Sagot,et al.  What Does BERT Learn about the Structure of Language? , 2019, ACL.

[6]  Liang Lu,et al.  Top-down Tree Long Short-Term Memory Networks , 2015, NAACL.

[7]  Peter Chin,et al.  Tree-Transformer: A Transformer-Based Method for Correction of Tree-Structured Data , 2019, ArXiv.

[8]  Hai Zhao,et al.  Syntax-aware Transformer Encoder for Neural Machine Translation , 2019, 2019 International Conference on Asian Language Processing (IALP).

[9]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[10]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[11]  Xing Wang,et al.  Self-Attention with Structural Position Representations , 2019, EMNLP.

[12]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[13]  Chris Quirk,et al.  Novel positional encodings to enable tree-structured transformers , 2018 .

[14]  Hava T. Siegelmann,et al.  On the Computational Power of Neural Nets , 1995, J. Comput. Syst. Sci..

[15]  Brooke Cowan,et al.  A tree-to-tree model for statistical machine translation , 2008 .

[16]  Li Yang,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[17]  Hung-Yi Lee,et al.  Tree Transformer: Integrating Tree Structures into Self-Attention , 2019, EMNLP/IJCNLP.

[18]  Kyunghyun Cho,et al.  Generating Diverse Translations with Sentence Codes , 2019, ACL.

[19]  Beatrice Kroch Anthony Santorini,et al.  The syntax of natural language: An online introduction using the Trees program , 2007 .

[20]  Rudolf Rosa,et al.  Extracting Syntactic Trees from Transformer Encoder Self-Attentions , 2018, BlackboxNLP@EMNLP.

[21]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[22]  Thomas Berg,et al.  Structure in Language: A Dynamic Perspective , 2008 .

[23]  Ankit Singh Rawat,et al.  Are Transformers universal approximators of sequence-to-sequence functions? , 2020, ICLR.

[24]  Majid Razmara,et al.  Application of Tree Transducers in Statistical Machine Translation , 2011 .

[25]  Lucia Specia,et al.  Text Simplification as Tree Transduction , 2013, STIL.

[26]  Roy Schwartz,et al.  Provable Limitations of Acquiring Meaning from Ungrounded Form: What Will Future Language Models Understand? , 2021, Transactions of the Association for Computational Linguistics.

[27]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28]  Christopher D. Manning,et al.  A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.