On the differences between BERT and MT encoder spaces and how to address them in translation tasks

Various studies show that pretrained language models such as BERT cannot straightforwardly replace encoders in neural machine translation despite their enormous success in other tasks. This is even more astonishing considering the similarities between the architectures. This paper sheds some light on the embedding spaces they create, using average cosine similarity, contextuality metrics and measures for representational similarity for comparison, revealing that BERT and NMT encoder representations look significantly different from one another. In order to address this issue, we propose a supervised transformation from one into the other using explicit alignment and fine-tuning. Our results demonstrate the need for such a transformation to improve the applicability of BERT in MT.

[1]  Sam Witteveen,et al.  Paraphrasing with Large Language Models , 2019, EMNLP.

[2]  Jörg Tiedemann,et al.  OpusTools and Parallel Corpus Diagnostics , 2020, LREC.

[3]  Philipp Koehn,et al.  Findings of the 2015 Workshop on Statistical Machine Translation , 2015, WMT@EMNLP.

[4]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[5]  Eneko Agirre,et al.  SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation , 2016, *SEMEVAL.

[6]  Garrison W. Cottrell,et al.  Content and cluster analysis: Assessing representational similarity in neural systems , 2000 .

[7]  Eiichiro Sumita,et al.  Recycling a Pre-trained BERT Encoder for Neural Machine Translation , 2019, EMNLP.

[8]  Yijia Liu,et al.  Cross-Lingual BERT Transformation for Zero-Shot Dependency Parsing , 2019, EMNLP.

[9]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[10]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[11]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[12]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[13]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[14]  Rico Sennrich,et al.  The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives , 2019, EMNLP.

[15]  Jascha Sohl-Dickstein,et al.  SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability , 2017, NIPS.

[16]  Lei Li,et al.  Towards Making the Most of BERT in Neural Machine Translation , 2020, AAAI.

[17]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[18]  Dan Klein,et al.  Multilingual Alignment of Contextual Word Representations , 2020, ICLR.

[19]  Tomoki Toda,et al.  Speech Recognition by Simply Fine-Tuning Bert , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[21]  Willem Zuidema,et al.  Blackbox Meets Blackbox: Representational Similarity & Stability Analysis of Neural Language Models and Brains , 2019, BlackboxNLP@ACL.

[22]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[23]  Nikolaus Kriegeskorte,et al.  Frontiers in Systems Neuroscience Systems Neuroscience , 2022 .

[24]  Laure Thompson,et al.  The strange geometry of skip-gram with negative sampling , 2017, EMNLP.

[25]  Guillaume Lample,et al.  What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[26]  Mirella Lapata,et al.  Text Summarization with Pretrained Encoders , 2019, EMNLP.

[27]  Anders Søgaard,et al.  A Survey of Cross-lingual Word Embedding Models , 2017, J. Artif. Intell. Res..

[28]  Kawin Ethayarajh,et al.  How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings , 2019, EMNLP.

[29]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[30]  Douwe Kiela,et al.  SentEval: An Evaluation Toolkit for Universal Sentence Representations , 2018, LREC.

[31]  Elahe Rahimtoroghi,et al.  What Happens To BERT Embeddings During Fine-tuning? , 2020, BLACKBOXNLP.

[32]  Mattia Antonino Di Gangi,et al.  MuST-C: a Multilingual Speech Translation Corpus , 2019, NAACL.

[33]  Tie-Yan Liu,et al.  Incorporating BERT into Neural Machine Translation , 2020, ICLR.

[34]  Xu Tan,et al.  MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[35]  Yoshua Bengio,et al.  An Empirical Investigation of Catastrophic Forgeting in Gradient-Based Neural Networks , 2013, ICLR.

[36]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[37]  Christopher D. Manning,et al.  A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.

[38]  Vassilina Nikoulina,et al.  On the use of BERT for Neural Machine Translation , 2019, EMNLP.

[39]  Afra Alishahi,et al.  Correlating Neural and Symbolic Representations of Language , 2019, ACL.

[40]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[41]  Regina Barzilay,et al.  Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing , 2019, NAACL.

[42]  Samy Bengio,et al.  Insights on representational similarity in neural networks with canonical correlation , 2018, NeurIPS.

[43]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.