Automatic Testing and Improvement of Machine Translation

This paper presents TransRepair, a fully automatic approach for testing and repairing the consistency of machine translation systems. TransRepair combines mutation with metamorphic testing to detect inconsistency bugs (without access to human oracles). It then adopts probability-reference or cross-reference to post-process the translations, in a grey-box or black-box manner, to repair the inconsistencies. Our evaluation on two state-of-the-art translators, Google Translate and Transformer, indicates that TransRepair has a high precision (99%) on generating input pairs with consistent translations. With these tests, using automatic consistency metrics and manual assessment, we find that Google Translate and Transformer have approximately 36% and 40% inconsistency bugs. Black-box repair fixes 28% and 19% bugs on average for Google Translate and Transformer. Grey-box repair fixes 30% bugs on average for Transformer. Manual inspection indicates that the translations repaired by our approach improve consistency in 87% of cases (degrading it in 2%), and that our repairs have better translation acceptability in 27% of the cases (worse in 8%).

[1]  Timothy Baldwin,et al.  Measurement of Progress in Machine Translation , 2012, ALTA.

[2]  Yonatan Belinkov,et al.  Synthetic and Natural Noise Both Break Neural Machine Translation , 2017, ICLR.

[3]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[4]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[5]  Hiroaki Yoshida,et al.  Elixir: Effective object-oriented program repair , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[6]  Xing Wang,et al.  Modeling Recurrence for Transformer , 2019, NAACL.

[7]  Dejing Dou,et al.  HotFlip: White-Box Adversarial Examples for Text Classification , 2017, ACL.

[8]  Rong Jin,et al.  Understanding bag-of-words model: a statistical framework , 2010, Int. J. Mach. Learn. Cybern..

[9]  David M. Brooks,et al.  Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[10]  Mark Harman,et al.  Predictive Mutation Testing , 2019, IEEE Transactions on Software Engineering.

[11]  Sameer Singh,et al.  Generating Natural Adversarial Examples , 2017, ICLR.

[12]  Beatrice Santorini,et al.  The Penn Treebank: An Overview , 2003 .

[13]  Carlos Guestrin,et al.  Semantically Equivalent Adversarial Rules for Debugging NLP models , 2018, ACL.

[14]  Mark Harman,et al.  Machine Learning Testing: Survey, Landscapes and Horizons , 2019, IEEE Transactions on Software Engineering.

[15]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[16]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[17]  Lijun Wu,et al.  Achieving Human Parity on Automatic Chinese to English News Translation , 2018, ArXiv.

[18]  Yang Liu,et al.  Towards Robust Neural Machine Translation , 2018, ACL.

[19]  Satoshi Nakamura,et al.  Guiding Neural Machine Translation with Retrieved Translation Pieces , 2018, NAACL.

[20]  Rining Wei,et al.  The statistics of English in China , 2012, English Today.

[21]  Huda Khayrallah,et al.  On the Impact of Various Types of Noise on Neural Machine Translation , 2018, NMT@ACL.

[22]  Omer Levy,et al.  Training on Synthetic Noise Improves Robustness to Natural Noise in Machine Translation , 2019, EMNLP.

[23]  Lu Zhang,et al.  An Empirical Study on the Scalability of Selective Mutation Testing , 2014, 2014 IEEE 25th International Symposium on Software Reliability Engineering.

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[26]  Yves Le Traon,et al.  Chapter Six - Mutation Testing Advances: An Analysis and Survey , 2019, Adv. Comput..

[27]  Yingfei Xiong,et al.  A manual inspection of Defects4J bugs and its implications for automatic program repair , 2019, Science China Information Sciences.

[28]  Claire Le Goues,et al.  GenProg: A Generic Method for Automatic Software Repair , 2012, IEEE Transactions on Software Engineering.

[29]  Mark Harman,et al.  An Analysis and Survey of the Development of Mutation Testing , 2011, IEEE Transactions on Software Engineering.

[30]  Sarah L. Nesbeitt Ethnologue: Languages of the World , 1999 .

[31]  Qi Xin,et al.  Leveraging syntax-related code for automated program repair , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[32]  Carlo Giglio,et al.  Article 17 of the Treaty of Uccialli , 1965, The Journal of African History.

[33]  R. Notley Short Papers , 1971, 2009 5th IEEE International Workshop on Visualizing Software for Understanding and Analysis.

[34]  Yong Wang,et al.  Search Engine Guided Neural Machine Translation , 2018, AAAI.

[35]  P. Lewis Ethnologue : languages of the world , 2009 .

[36]  Marcin Junczys-Dowmunt,et al.  The United Nations Parallel Corpus v1.0 , 2016, LREC.

[37]  Josef van Genabith,et al.  How Robust Are Character-Based Word Embeddings in Tagging and MT Against Wrod Scramlbing or Randdm Nouse? , 2017, AMTA.

[38]  Bohn Stafleu van Loghum Google translate , 2017 .

[39]  Iryna Gurevych,et al.  Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , 2018, ACL 2018.

[40]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[41]  Yong Cheng,et al.  Robust Neural Machine Translation with Doubly Adversarial Inputs , 2019, ACL.

[42]  Zhi Quan Zhou,et al.  Metamorphic Testing for Machine Translations: MT4MT , 2018, 2018 25th Australasian Software Engineering Conference (ASWEC).

[43]  Tsong Yueh Chen,et al.  Metamorphic Testing: A New Approach for Generating Next Test Cases , 2020, ArXiv.

[44]  Yang Liu,et al.  Contrastive Unsupervised Word Alignment with Non-Local Features , 2014, AAAI.

[45]  A. Waibel,et al.  Toward Robust Neural Machine Translation for Noisy Input Sequences , 2017, IWSLT.

[46]  Hongyu Zhang,et al.  Shaping program repair space with existing patches and similar code , 2018, ISSTA.

[47]  Lu Zhang,et al.  Search-based inference of polynomial metamorphic relations , 2014, ASE.

[48]  Thomas G. Szymanski,et al.  A fast algorithm for computing longest common subsequences , 1977, CACM.

[49]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[50]  Samy Bengio,et al.  Tensor2Tensor for Neural Machine Translation , 2018, AMTA.