论文信息 - Translation Error Detection as Rationale Extraction - 字舞流文

Translation Error Detection as Rationale Extraction

Recent Quality Estimation (QE) models based on multilingual pre-trained representations have achieved very competitive results when predicting the overall quality of translated sentences. Predicting translation errors, i.e. detecting specifically which words are incorrect, is a more challenging task, especially with limited amounts of training data. We hypothesize that, not unlike humans, successful QE models rely on translation errors to predict overall sentence quality. By exploring a set of feature attribution methods that assign relevance scores to the inputs to explain model predictions, we study the behaviour of state-of-theart sentence-level QE models and show that explanations (i.e. rationales) extracted from these models can indeed be used to detect translation errors. We therefore (i) introduce a novel semi-supervised method for word-level QE and (ii) propose to use the QE task as a new benchmark for evaluating the plausibility of feature attribution, i.e. how interpretable model explanations are to humans.

Nikolaos Aletras | Marina Fomicheva | Lucia Specia | M. Fomicheva | Nikolaos Aletras | Lucia Specia

[1] Ankur Taly,et al. Axiomatic Attribution for Deep Networks , 2017, ICML.

[2] Christine D. Piatko,et al. Using “Annotator Rationales” to Improve Machine Learning for Text Categorization , 2007, NAACL.

[3] Yann LeCun,et al. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors , 2021, DEELIO.

[4] Nello Cristianini,et al. Estimating the Sentence-Level Quality of Machine Translation Systems , 2009, EAMT.

[5] Lucia Specia,et al. Findings of the WMT 2020 Shared Task on Quality Estimation , 2020, WMT.

[6] Byron C. Wallace,et al. Attention is not Explanation , 2019, NAACL.

[7] Jong-Hyeok Lee,et al. Predictor-Estimator using Multilevel Task Learning with Stack Propagation for Neural Quality Estimation , 2017, WMT.

[8] Byron C. Wallace,et al. Learning to Faithfully Rationalize by Construction , 2020, ACL.

[9] Matthew G. Snover,et al. A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[10] Lucia Specia,et al. Quality Estimation without Human-labeled Data , 2021, EACL.

[11] Yuval Pinter,et al. Attention is not not Explanation , 2019, EMNLP.

[12] Rico Sennrich,et al. The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives , 2019, EMNLP.

[13] Yoav Goldberg,et al. Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? , 2020, ACL.

[14] Constantin Orasan,et al. TransQuest at WMT2020: Sentence-Level Direct Assessment , 2020, WMT.

[15] Naftali Tishby,et al. Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[16] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[17] Jakob Grue Simonsen,et al. A Diagnostic Study of Explainability Techniques for Text Classification , 2020, EMNLP.

[18] Alex Kulesza,et al. Confidence Estimation for Machine Translation , 2004, COLING.

[19] Zachary Chase Lipton. The mythos of model interpretability , 2016, ACM Queue.

[20] Mark Fishel,et al. Confidence through Attention , 2017, MTSummit.

[21] Dongjun Lee,et al. Two-Phase Cross-Lingual Language Model Fine-Tuning for Machine Translation Quality Estimation , 2020, WMT.

[22] Constantin Orasan,et al. TransQuest: Translation Quality Estimation with Cross-lingual Transformers , 2020, COLING.

[23] Philipp Koehn,et al. Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English , 2019, ArXiv.

[24] Markus Freitag,et al. Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation , 2021, Transactions of the Association for Computational Linguistics.

[25] Veselin Stoyanov,et al. Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[26] Tommi S. Jaakkola,et al. Rethinking Cooperative Rationalization: Introspective Extraction and Complement Control , 2019, EMNLP.

[27] Lucia Specia,et al. Unsupervised Quality Estimation for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[28] Ion Androutsopoulos,et al. Paragraph-level Rationale Extraction through Regularization: A case study on European Court of Human Rights Cases , 2021, NAACL.

[29] Timothy Baldwin,et al. Can machine translation systems be evaluated by the crowd alone , 2015, Natural Language Engineering.

[30] Nicola De Cao,et al. How Do Decisions Emerge across Layers in Neural Models? Interpretation with Differentiable Masking , 2020, EMNLP.

[31] Andrew Zisserman,et al. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[32] Carlos Guestrin,et al. "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[33] Byron C. Wallace,et al. ERASER: A Benchmark to Evaluate Rationalized NLP Models , 2020, ACL.