Oracle-free Detection of Translation Issue for Neural Machine Translation

Neural Machine Translation (NMT) has been widely adopted over recent years due to its advantages on various translation tasks. However, NMT systems can be error-prone due to the intractability of natural languages and the design of neural networks, bringing issues to their translations. These issues could potentially lead to information loss, wrong semantics, and low readability in translations, compromising the usefulness of NMT and leading to potential non-trivial consequences. Although there are existing approaches, such as using the BLEU score, on quality assessment and issue detection for NMT, such approaches face two serious limitations. First, such solutions require oracle translations, i.e., reference translations, which are often unavailable, e.g., in production environments. Second, such approaches cannot pinpoint the issue types and locations within translations. To address such limitations, we propose a new approach aiming to precisely detect issues in translations without requiring oracle translations. Our approach focuses on two most prominent issues in NMT translations by including two detection algorithms. Our experimental results show that our new approach could achieve high effectiveness on real-world datasets. Our successful experience on deploying the proposed algorithms in both the development and production environments of WeChat, a messenger app with over one billion of monthly active users, helps eliminate numerous defects of our NMT model, monitor the effectiveness on real-world translation tasks, and collect in-house test cases, producing high industry impact.

[1]  Pratap Dangeti,et al.  Statistics for Machine Learning , 2017 .

[2]  Gediminas Adomavicius,et al.  Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions , 2005, IEEE Transactions on Knowledge and Data Engineering.

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Jürgen Schmidhuber,et al.  Learning to forget: continual prediction with LSTM , 1999 .

[5]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[6]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[7]  Lei Ma,et al.  DeepMutation: Mutation Testing of Deep Learning Systems , 2018, 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE).

[8]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[9]  Suman Jana,et al.  DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[10]  Tao Xie,et al.  Telemade: A Testing Framework for Learning-Based Malware Detection Systems , 2018, AAAI Workshops.

[11]  Greg Linden,et al.  Amazon . com Recommendations Item-to-Item Collaborative Filtering , 2001 .

[12]  Yang Liu,et al.  Modeling Coverage for Neural Machine Translation , 2016, ACL.

[13]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[14]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[15]  Lei Ma,et al.  DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[16]  Yu He,et al.  The YouTube video recommendation system , 2010, RecSys '10.

[17]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[18]  Markus Zanker,et al.  Proceedings of the fourth ACM conference on Recommender systems , 2010, RecSys 2010.

[19]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[20]  Daniel Kroening,et al.  Concolic Testing for Deep Neural Networks , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[21]  R. P. Jagadeesh Chandra Bose,et al.  Identifying implementation bugs in machine learning based image classifiers using metamorphic testing , 2018, ISSTA.

[22]  Sarfraz Khurshid,et al.  DeepRoad: GAN-Based Metamorphic Testing and Input Validation Framework for Autonomous Driving Systems , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[23]  Philip Koehn,et al.  Statistical Machine Translation , 2010, EAMT.

[24]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[25]  John Riedl,et al.  Item-based collaborative filtering recommendation algorithms , 2001, WWW '01.

[26]  Gail E. Kaiser,et al.  Quality Assurance of Software Applications Using the In Vivo Testing Approach , 2009, 2009 International Conference on Software Testing Verification and Validation.

[27]  Junfeng Yang,et al.  DeepXplore: Automated Whitebox Testing of Deep Learning Systems , 2017, SOSP.

[28]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[29]  Nikolaj Bjørner,et al.  Satisfiability modulo theories , 2011, Commun. ACM.

[30]  Chin-Yew Lin,et al.  ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation , 2004, COLING.

[31]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[32]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.