论文信息 - BERT is Not an Interlingua and the Bias of Tokenization - 字舞流文

BERT is Not an Interlingua and the Bias of Tokenization

Multilingual transfer learning can benefit both high- and low-resource languages, but the source of these improvements is not well understood. Cananical Correlation Analysis (CCA) of the internal representations of a pre- trained, multilingual BERT model reveals that the model partitions representations for each language rather than using a common, shared, interlingual space. This effect is magnified at deeper layers, suggesting that the model does not progressively abstract semantic con- tent while disregarding languages. Hierarchical clustering based on the CCA similarity scores between languages reveals a tree structure that mirrors the phylogenetic trees hand- designed by linguists. The subword tokenization employed by BERT provides a stronger bias towards such structure than character- and word-level tokenizations. We release a subset of the XNLI dataset translated into an additional 14 languages at https://www.github.com/salesforce/xnli_extension to assist further research into multilingual representations.

Richard Socher | Caiming Xiong | Jasdeep Singh | Bryan McCann | R. Socher | Bryan McCann | Caiming Xiong | Jasdeep Singh

[1] Richard Socher,et al. Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[2] Ivan Titov,et al. Inducing Crosslingual Distributed Representations of Words , 2012, COLING.

[3] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[4] H. Hotelling. Relations Between Two Sets of Variates , 1936 .

[5] Christopher D. Manning,et al. Bilingual Word Embeddings for Phrase-Based Machine Translation , 2013, EMNLP.

[6] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[7] Yonatan Belinkov,et al. Linguistic Knowledge and Transferability of Contextual Representations , 2019, NAACL.

[8] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[9] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[10] Yoshua Bengio,et al. BilBOWA: Fast Bilingual Distributed Representations without Word Alignments , 2014, ICML.

[11] Robert R. Sokal,et al. A statistical method for evaluating systematic relationships , 1958 .

[12] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[13] Guillaume Lample,et al. Word Translation Without Parallel Data , 2017, ICLR.

[14] Adam Lopez,et al. Understanding Learning Dynamics Of Language Models with SVCCA , 2018, NAACL.

[15] Manaal Faruqui,et al. Improving Vector Space Word Representations Using Multilingual Correlation , 2014, EACL.

[16] Christopher D. Manning,et al. Bilingual Word Representations with Monolingual Quality in Mind , 2015, VS@HLT-NAACL.

[17] Ankur Bapna,et al. Investigating Multilingual NMT Representations at Scale , 2019, EMNLP.

[18] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[19] Phil Blunsom,et al. Multilingual Models for Compositional Distributed Semantics , 2014, ACL.

[20] Eneko Agirre,et al. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings , 2018, ACL.

[21] Martin Wattenberg,et al. Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation , 2016, TACL.

[22] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[23] Quoc V. Le,et al. Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[24] Guillaume Lample,et al. Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[25] Richard Socher,et al. XLDA: Cross-Lingual Data Augmentation for Natural Language Inference and Question Answering , 2019, ArXiv.

[26] Jascha Sohl-Dickstein,et al. SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability , 2017, NIPS.

[27] Samy Bengio,et al. Insights on representational similarity in neural networks with canonical correlation , 2018, NeurIPS.

[28] Ilya Sutskever,et al. SUBWORD LANGUAGE MODELING WITH NEURAL NETWORKS , 2011 .

[29] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[30] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.