Contrastive Conditioning for Assessing Disambiguation in MT: A Case Study of Distilled Bias

Lexical disambiguation is a major challenge for machine translation systems, especially if some senses of a word are trained less often than others. Identifying patterns of overgeneralization requires evaluation methods that are both reliable and scalable. We propose contrastive conditioning as a reference-free black-box method for detecting disambiguation errors. Specifically, we score the quality of a translation by conditioning on variants of the source that provide contrastive disambiguation cues. After validating our method, we apply it in a case study to perform a targeted evaluation of sequence-level knowledge distillation. By probing word sense disambiguation and translation of gendered occupation names, we show that distillation-trained models tend to overgeneralize more than other models with a comparable BLEU score. Contrastive conditioning thus highlights a side effect of distillation that is not fully captured by standard evaluation metrics. Code and data to reproduce our findings are publicly available.

[1]  Bill Byrne,et al.  Reducing Gender Bias in Neural Machine Translation as a Domain Adaptation Problem , 2020, ACL.

[2]  Rada Mihalcea,et al.  Word Sense Disambiguation , 2015, Encyclopedia of Machine Learning.

[3]  Jieyu Zhao,et al.  Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods , 2018, NAACL.

[4]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[5]  Noah A. Smith,et al.  Evaluating Gender Bias in Machine Translation , 2019, ACL.

[6]  Emily M. Bender Book Reviews: Linguistic Fundamentals for Natural Language Processing: 100 Essentials from Morphology and Syntax by Emily M. Bender , 2013, CL.

[7]  Adithya Renduchintala,et al.  Investigating Failures of Automatic Translation in the Case of Unambiguous Gender , 2021, ACL.

[8]  Graham Neubig,et al.  Understanding Knowledge Distillation in Non-autoregressive Machine Translation , 2019, ICLR.

[9]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[10]  Zhaopeng Tu,et al.  Understanding and Improving Lexical Choice in Non-Autoregressive Translation , 2020, ICLR.

[11]  Andy Way,et al.  Lost in Translation: Loss and Decay of Linguistic Richness in Machine Translation , 2019, MTSummit.

[12]  Marcus Tomalin,et al.  The practical ethics of bias reduction in machine translation: why domain adaptation is better than data debiasing , 2021, Ethics and Information Technology.

[13]  Daphne Koller,et al.  Word-Sense Disambiguation for Machine Translation , 2005, HLT.

[14]  J. Crego,et al.  Analyzing Knowledge Distillation in Neural Machine Translation , 2018, IWSLT.

[15]  Pradyumna Tambwekar,et al.  Towards a Comprehensive Understanding and Accurate Evaluation of Societal Biases in Pre-Trained Transformers , 2021, NAACL.

[16]  M. Costa-jussà,et al.  Evaluating Gender Bias in Speech Translation , 2020, ArXiv.

[17]  Andrei Popescu-Belis,et al.  Context in Neural Machine Translation: A Review of Models and Evaluations , 2019, ArXiv.

[18]  Yang Trista Cao,et al.  Toward Gender-Inclusive Coreference Resolution , 2019, ACL.

[19]  J. Rawls,et al.  A Theory of Justice , 1971, Princeton Readings in Political Thought.

[20]  Myle Ott,et al.  Facebook FAIR’s WMT19 News Translation Task Submission , 2019, WMT.

[21]  Christopher D. Manning,et al.  Stanza: A Python Natural Language Processing Toolkit for Many Human Languages , 2020, ACL.

[22]  Rico Sennrich,et al.  Linguistic Input Features Improve Neural Machine Translation , 2016, WMT.

[23]  Brian Larson,et al.  Gender as a Variable in Natural-Language Processing: Ethical Considerations , 2017, EthNLP@EACL.

[24]  Carlos Escolano,et al.  Gender Bias in Multilingual Neural Machine Translation: The Architecture Matters , 2020, ArXiv.

[25]  Laura Mascarell,et al.  Improving Word Sense Disambiguation in Neural Machine Translation with Sense Embeddings , 2017, WMT.

[26]  Davis Liang,et al.  Decoding and Diversity in Machine Translation , 2020, ArXiv.

[27]  B. Byrne,et al.  Neural Machine Translation Doesn’t Translate Gender Coreference Right Unless You Make It , 2020, GEBNLP.

[28]  K. Murphy,et al.  Overview of Machine Learning , 2022, International Journal of Advanced Research in Science, Communication and Technology.

[29]  Anja Walter,et al.  A Comprehensive Russian Grammar , 2016 .

[30]  Kenneth Heafield,et al.  Gender bias amplification during Speed-Quality optimization in Neural Machine Translation , 2021, ACL.

[31]  Rico Sennrich,et al.  How Grammatical is Character-level Neural Machine Translation? Assessing MT Quality with Contrastive Translation Pairs , 2016, EACL.

[32]  Dirk Hovy,et al.  The Social Impact of Natural Language Processing , 2016, ACL.

[33]  Kai-Wei Chang,et al.  Societal Biases in Language Generation: Progress and Challenges , 2021, ACL.

[34]  Joakim Nivre,et al.  An Analysis of Attention Mechanisms: The Case of Word Sense Disambiguation in Neural Machine Translation , 2018, WMT.

[35]  Marcis Pinnis,et al.  Mitigating Gender Bias in Machine Translation with Target Gender Annotations , 2020, WMT.

[36]  Rachel Rudinger,et al.  Gender Bias in Coreference Resolution , 2018, NAACL.

[37]  Georgiana Dinu,et al.  Improving Gender Translation Accuracy with Filtered Self-Training , 2021, ArXiv.

[38]  Sally Mcconnell-Ginet ` Gender and its relation to sex: The myth of ‘natural’ gender , 2013 .

[39]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[40]  Bruce J Hillman Gender Bias. , 2018, Journal of the American College of Radiology : JACR.

[41]  Jörg Tiedemann,et al.  The MUCOW word sense disambiguation test suite at WMT 2020 , 2020, WMT@EMNLP.

[42]  Jörg Tiedemann,et al.  The MuCoW Test Suite at WMT 2019: Automatically Harvested Multilingual Contrastive Word Sense Disambiguation Test Sets for Machine Translation , 2019, WMT.

[43]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[44]  Rico Sennrich,et al.  Detecting Word Sense Disambiguation Biases in Machine Translation for Model-Agnostic Adversarial Attacks , 2020, EMNLP.

[45]  Solon Barocas,et al.  Language (Technology) is Power: A Critical Survey of “Bias” in NLP , 2020, ACL.

[46]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[47]  Marta R. Costa-jussà,et al.  Findings of the 2019 Conference on Machine Translation (WMT19) , 2019, WMT.

[48]  Jörg Tiedemann,et al.  An Evaluation Benchmark for Testing the Word Sense Disambiguation Capabilities of Machine Translation Systems , 2020, LREC.

[49]  Gabriel Stanovsky,et al.  Gender Coreference and Bias Evaluation at WMT 2020 , 2020, WMT@EMNLP.

[50]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.