Domain-specific machine translation with recurrent neural network for software localization

Software localization is the process of adapting a software product to the linguistic, cultural and technical requirements of a target market. It allows software companies to access foreign markets that would be otherwise difficult to penetrate. Many studies have been carried out to locate need-to-translate strings in software and adapt UI layout after text translation in the new language. However, no work has been done on the most important and time-consuming step of software localization process, i.e., the translation of software text. Due to some unique characteristics of software text, for example, application-specific meanings, context-sensitive translation, domain-specific rare words, general machine translation tools such as Google Translate cannot properly address linguistic and technical nuance in translating software text for software localization. In this paper, we propose a neural-network based translation model specifically designed and trained for mobile application text translation. We collect large-scale human-translated bilingual sentence pairs inside different Android applications, which are crawled from Google Play store. We customize the original RNN encoder-decoder neural machine translation model by adding categorical information addressing the domain-specific rare word problem which is common phenomenon in software text. We evaluate our approach in translating the text of testing Android applications by both BLEU score and exact match rate. The results show that our method outperforms the general machine translation tool, Google Translate, and generates more acceptable translation for software localization with less needs for human revision. Our approach is language independent, and we show the generality of our approach between English and the other five official languages used in United Nation (UN).

[1]  Zhenchang Xing,et al.  Learning a dual-language vector space for domain-specific cross-lingual question retrieval , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[2]  Xiaoyin Wang,et al.  A empirical study on the status of software localization in open source projects , 2015, ICSE 2015.

[3]  Anh Tuan Nguyen,et al.  Migrating code with statistical machine translation , 2014, ICSE Companion.

[4]  Milam Aiken,et al.  An Analysis of Google Translate Accuracy , 2012 .

[5]  Xiaodong Gu,et al.  Deep Code Search , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[6]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[7]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[8]  Yang Liu,et al.  Tell Them Apart: Distilling Technology Differences from Crowd-Scale Comparison Discussions , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[9]  Feng Zhu,et al.  Software Internationalization and Localization: An Industrial Experience , 2013, 2013 18th International Conference on Engineering of Complex Computer Systems.

[10]  Hermann Ney,et al.  Phrase-Based Statistical Machine Translation , 2002, KI.

[11]  Tjalling Haije,et al.  Automatic Comment Generation using a Neural Translation Model , 2016 .

[12]  Christian Borgelt,et al.  Frequent item set mining , 2012, WIREs Data Mining Knowl. Discov..

[13]  Zhenchang Xing,et al.  Mining Analogical Libraries in Q&A Discussions -- Incorporating Relational and Categorical Knowledge into Word Embedding , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[14]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[15]  Guoqiang Li,et al.  Data-Driven Proactive Policy Assurance of Post Quality in Community q&a Sites , 2018, Proc. ACM Hum. Comput. Interact..

[16]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[17]  P J Webros BACKPROPAGATION THROUGH TIME: WHAT IT DOES AND HOW TO DO IT , 1990 .

[18]  Zhenchang Xing,et al.  Mining Likely Analogical APIs Across Third-Party Libraries via Large-Scale Unsupervised API Semantics Embedding , 2019, IEEE Transactions on Software Engineering.

[19]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[20]  Yang Liu,et al.  From UI Design Image to GUI Skeleton: A Neural Machine Translator to Bootstrap Mobile GUI Implementation , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[21]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[22]  Daniel Gildea,et al.  Unsupervised Tokenization for Machine Translation , 2009, EMNLP.

[23]  Zhenchang Xing,et al.  Mining Technology Landscape from Stack Overflow , 2016, ESEM.

[24]  Yang Liu,et al.  What’s Spain’s Paris? Mining analogical libraries from Q&A discussions , 2018, Empirical Software Engineering.

[25]  Martin Volk,et al.  Mining for Domain-specific Parallel Text from Wikipedia , 2013, BUCC@ACL.

[26]  Martin White,et al.  Toward Deep Learning Software Repositories , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[27]  Alexander M. Fraser,et al.  Squibs and Discussions: Measuring Word Alignment Quality for Statistical Machine Translation , 2007, CL.

[28]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[29]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[30]  Anh Tuan Nguyen,et al.  Lexical statistical machine translation for language migration , 2013, ESEC/FSE 2013.

[31]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[32]  Jiajun Zhang,et al.  Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation , 2013, ACL.

[33]  Xiaodong Gu,et al.  Deep API learning , 2016, SIGSOFT FSE.

[34]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[35]  Anh Tuan Nguyen,et al.  Divide-and-Conquer Approach for Multi-phase Statistical Migration for Source Code (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[36]  David Lo,et al.  Deep Code Comment Generation , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[37]  Lukás Burget,et al.  Strategies for training large scale neural network language models , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[38]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[39]  Sharon O'Brien Practical Experience of Computer-Aided Translation Tools in the Software Localization Industry , 2014 .

[40]  Yang Liu,et al.  By the Community & For the Community , 2017, Proc. ACM Hum. Comput. Interact..

[41]  W. Rice ANALYZING TABLES OF STATISTICAL TESTS , 1989, Evolution; international journal of organic evolution.

[42]  Alexander H. Waibel,et al.  Improving Statistical Machine Translation in the Medical Domain using the Unified Medical Language system , 2004, COLING.

[43]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[44]  Zhenchang Xing,et al.  TechLand: Assisting Technology Landscape Inquiries with Insights from Stack Overflow , 2016, 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[45]  Victor Muntés-Mulero,et al.  Context-Aware Machine Translation for Software Localization , 2012, EAMT.

[46]  Tao Xie,et al.  Locating need-to-translate constant strings in web applications , 2010, FSE '10.

[47]  Quoc V. Le,et al.  Addressing the Rare Word Problem in Neural Machine Translation , 2014, ACL.

[48]  William G. J. Halfond,et al.  Detecting and Localizing Internationalization Presentation Failures in Web Applications , 2016, 2016 IEEE International Conference on Software Testing, Verification and Validation (ICST).

[49]  Tomoki Toda,et al.  Learning to Generate Pseudo-Code from Source Code Using Statistical Machine Translation (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[50]  Qun Liu,et al.  Improving Statistical Machine Translation Using Domain Bilingual Multiword Expressions , 2009, MWE@IJCNLP.

[51]  Zhenchang Xing,et al.  A Neural Model for Method Name Generation from Functional Description , 2019, 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[52]  Chengqing Zong,et al.  Domain Adaptation for Statistical Machine Translation with Domain Dictionary and Monolingual Corpora , 2008, COLING.

[53]  Xiaodong Gu,et al.  DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning , 2017, IJCAI.

[54]  Zhenchang Xing,et al.  SimilarTech: Automatically recommend analogical libraries across different programming languages , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[55]  Christopher D. Manning,et al.  Phrasal: A Toolkit for New Directions in Statistical Machine Translation , 2014, WMT@ACL.

[56]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[57]  Zhenchang Xing,et al.  Unsupervised Software-Specific Morphological Forms Inference from Informal Discussions , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).