Testing Machine Translation via Referential Transparency

Machine translation software has seen rapid progress in recent years due to the advancement of deep Neural Networks. People routinely use machine translation software in their daily lives for tasks such as ordering food in a foreign restaurant, receiving medical diagnosis and treatment from foreign doctors, and reading international political news online. However, due to the complexity and intractability of the underlying Neural Networks, modern machine translation software is still far from robust and can produce poor or incorrect translations; this can lead to misunderstanding, financial loss, threats to personal safety and health, and political conflicts. To address this problem, we introduce referentially transparent inputs (RTIs), a simple, widely applicable methodology for validating machine translation software. A referentially transparent input is a piece of text that should have similar translations when used in different contexts. Our practical implementation, Purity, detects when this property is broken by a translation. To evaluate RTI, we use Purity to test Google Translate and Bing Microsoft Translator with 200 unlabeled sentences, which detected 123 and 142 erroneous translations with high precision (79.3% and 78.3%). The translation errors are diverse, including examples of under-translation, over-translation, word/phrase mistranslation, incorrect modification, and unclear logic.

[1]  Frank T. Lyman Translate , 2021, 100 Teaching Ideas that Transfer and Transform Learning.

[2]  Shashij Gupta,et al.  Machine Translation Testing via Pathological Invariance , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion).

[3]  Tsong Yueh Chen,et al.  Metamorphic Testing: A New Approach for Generating Next Test Cases , 2020, ArXiv.

[4]  David Hirtle Translator , 2020, Definitions.

[5]  Hang Su,et al.  Benchmarking Adversarial Robustness , 2019, ArXiv.

[6]  Sankalan Pal Chowdhury,et al.  DeepSearch: Simple and Effective Blackbox Fuzzing of Deep Neural Networks , 2019, ArXiv.

[7]  Jie M. Zhang,et al.  Automatic Testing and Improvement of Machine Translation , 2019, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[8]  Carl Vondrick,et al.  Metric Learning for Adversarial Robustness , 2019, NeurIPS.

[9]  Jianjun Zhao,et al.  DeepStellar: model-based quantitative analysis of stateful deep learning systems , 2019, ESEC/SIGSOFT FSE.

[10]  Pinjia He,et al.  Structure-Invariant Testing for Machine Translation , 2019, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[11]  Gordon Fraser,et al.  Automatically testing self-driving cars with search-based procedural content generation , 2019, ISSTA.

[12]  Lei Ma,et al.  DeepHunter: a coverage-guided fuzz testing framework for deep neural networks , 2019, ISSTA.

[13]  Mark Harman,et al.  Machine Learning Testing: Survey, Landscapes and Horizons , 2019, IEEE Transactions on Software Engineering.

[14]  Yong Cheng,et al.  Robust Neural Machine Translation with Doubly Adversarial Inputs , 2019, ACL.

[15]  Tao Xie,et al.  Detecting Failures of Neural Machine Translation in the Absence of Reference Translations , 2019, 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks – Industry Track.

[16]  Zachary Chase Lipton,et al.  Combating Adversarial Misspellings with Robust Word Recognition , 2019, ACL.

[17]  Song Han,et al.  Defensive Quantization: When Efficiency Meets Robustness , 2019, ICLR.

[18]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[19]  Colin Raffel,et al.  Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition , 2019, ICML.

[20]  Alessandro Orso,et al.  Robustness of Neural Networks: A Probabilistic and Practical Approach , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER).

[21]  Ting Wang,et al.  SirenAttack: Generating Adversarial Audio for End-to-End Acoustic Systems , 2019, AsiaCCS.

[22]  Jingyi Wang,et al.  Adversarial Sample Detection for Deep Neural Network through Model Mutation Testing , 2018, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[23]  S. Ji,et al.  TextBugger: Generating Adversarial Text Against Real-world Applications , 2018, NDSS.

[24]  Zhi Quan Zhou,et al.  Metamorphic Testing for Machine Translations: MT4MT , 2018, 2018 25th Australasian Software Engineering Conference (ASWEC).

[25]  Ondrej Bojar,et al.  Findings of the 2018 Conference on Machine Translation (WMT18) , 2018, WMT.

[26]  Xiangyu Zhang,et al.  Attacks Meet Interpretability: Attribute-steered Detection of Adversarial Samples , 2018, NeurIPS.

[27]  Wen-Chuan Lee,et al.  MODE: automated neural network model debugging via state differential analysis and input selection , 2018, ESEC/SIGSOFT FSE.

[28]  Corina S. Pasareanu,et al.  DeepSafe: A Data-Driven Approach for Assessing Robustness of Neural Networks , 2018, ATVA.

[29]  Mingyan Liu,et al.  Realistic Adversarial Examples in 3D Meshes , 2018, ArXiv.

[30]  Chong Xiang,et al.  Generating 3D Adversarial Point Clouds , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Shin Yoo,et al.  Guiding Deep Learning System Testing Using Surprise Adequacy , 2018, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[32]  Tao Xie,et al.  Testing Untestable Neural Machine Translation: An Industrial Case , 2018, 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion).

[33]  Carlos Guestrin,et al.  Semantically Equivalent Adversarial Rules for Debugging NLP models , 2018, ACL.

[34]  Dejing Dou,et al.  On Adversarial Examples for Character-Level Neural Machine Translation , 2018, COLING.

[35]  Ankur Taly,et al.  Did the Model Understand the Question? , 2018, ACL.

[36]  Deyi Xiong,et al.  Accelerating Neural Transformer via an Average Attention Network , 2018, ACL.

[37]  Yang Liu,et al.  Towards Robust Neural Machine Translation , 2018, ACL.

[38]  Daniel Kroening,et al.  Concolic Testing for Deep Neural Networks , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[39]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[40]  Mani B. Srivastava,et al.  Generating Natural Language Adversarial Examples , 2018, EMNLP.

[41]  Luke S. Zettlemoyer,et al.  Adversarial Example Generation with Syntactically Controlled Paraphrase Networks , 2018, NAACL.

[42]  Lei Ma,et al.  DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[43]  Harini Kannan,et al.  Adversarial Logit Pairing , 2018, NIPS 2018.

[44]  F. Seide,et al.  Achieving Human Parity on Automatic Chinese to English News Translation , 2018, ArXiv.

[45]  Sarfraz Khurshid,et al.  DeepRoad: GAN-based Metamorphic Autonomous Driving System Testing , 2018, ArXiv.

[46]  David A. Wagner,et al.  Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples , 2018, ICML.

[47]  Yonatan Belinkov,et al.  Synthetic and Natural Noise Both Break Neural Machine Translation , 2017, ICLR.

[48]  Suman Jana,et al.  DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[49]  Dawn Song,et al.  Robust Physical-World Attacks on Deep Learning Models , 2017, 1707.08945.

[50]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[51]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[52]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[53]  Junfeng Yang,et al.  DeepXplore: Automated Whitebox Testing of Deep Learning Systems , 2017, SOSP.

[54]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[55]  Yanjun Qi,et al.  Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks , 2017, NDSS.

[56]  Chenhui Chu,et al.  An Empirical Comparison of Simple Domain Adaptation Methods for Neural Machine Translation , 2017, ArXiv.

[57]  Yann Dauphin,et al.  A Convolutional Encoder Model for Neural Machine Translation , 2016, ACL.

[58]  Quoc V. Le,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[59]  Sergio Segura,et al.  A Survey on Metamorphic Testing , 2016, IEEE Transactions on Software Engineering.

[60]  Nicholas Carlini,et al.  Hidden Voice Commands , 2016, USENIX Security Symposium.

[61]  David Griol,et al.  The Conversational Interface: Talking to Smart Devices , 2016 .

[62]  James M. Bieman,et al.  Predicting metamorphic relations for testing scientific software: a machine learning approach using graph kernels , 2016, Softw. Test. Verification Reliab..

[63]  Tsong Yueh Chen,et al.  Metamorphic Testing for Software Quality Assessment: A Study of Search Engines , 2016, IEEE Transactions on Software Engineering.

[64]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[65]  Ananthram Swami,et al.  Distillation as a Defense to Adversarial Perturbations Against Deep Neural Networks , 2015, 2016 IEEE Symposium on Security and Privacy (SP).

[66]  Alexandra Birch,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[67]  Alastair F. Donaldson,et al.  Many-core compiler fuzzing , 2015, PLDI.

[68]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[69]  Yang Liu,et al.  Contrastive Unsupervised Word Alignment with Non-Local Features , 2014, AAAI.

[70]  Lu Zhang,et al.  Search-based inference of polynomial metamorphic relations , 2014, ASE.

[71]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[72]  Zhendong Su,et al.  Compiler validation via equivalence modulo inputs , 2014, PLDI.

[73]  Yue Zhang,et al.  Fast and Accurate Shift-Reduce Constituent Parsing , 2013, ACL.

[74]  Baowen Xu,et al.  Testing and validating machine learning classifiers by metamorphic testing , 2011, J. Syst. Softw..

[75]  Frank Gaillard,et al.  Glove , 2010, Radiopaedia.org.

[76]  Baowen Xu,et al.  Application of Metamorphic Testing to Supervised Classifiers , 2009, 2009 Ninth International Conference on Quality Software.

[77]  W. Chan,et al.  A Metamorphic Testing Approach for Online Testing of Service-Oriented Software Applications , 2007, Int. J. Web Serv. Res..

[78]  Shing-Chi Cheung,et al.  Towards a metamorphic testing methodology for service-oriented software applications , 2005, Fifth International Conference on Quality Software (QSIC'05).

[79]  Peter Sestoft,et al.  Referential transparency, definiteness and unfoldability , 1990, Acta Informatica.

[80]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[81]  Wen-Chuan Lee,et al.  NIC: Detecting Adversarial Samples with Neural Network Invariant Checking , 2019, NDSS.

[82]  Chenhui Chu,et al.  An Empirical Comparison of Domain Adaptation Methods for Neural Machine Translation , 2017, ACL.

[83]  Bohn Stafleu van Loghum Google translate , 2017 .

[84]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[85]  Gail E. Kaiser,et al.  Properties of Machine Learning Applications for Use in Metamorphic Testing , 2008, SEKE.