Lexical statistical machine translation for language migration

Prior research has shown that source code also exhibits naturalness, i.e. it is written by humans and is likely to be repetitive. The researchers also showed that the n-gram language model is useful in predicting the next token in a source file given a large corpus of existing source code. In this paper, we investigate how well statistical machine translation (SMT) models for natural languages could help in migrating source code from one programming language to another. We treat source code as a sequence of lexical tokens and apply a phrase-based SMT model on the lexemes of those tokens. Our empirical evaluation on migrating two Java projects into C# showed that lexical, phrase-based SMT could achieve high lexical translation accuracy (BLEU from 81.3-82.6%). Users would have to manually edit only 11.9-15.8% of the total number of tokens in the resulting code to correct it. However, a high percentage of total translation methods (49.5-58.6%) is syntactically incorrect. Therefore, our result calls for a more program-oriented SMT model that is capable of better integrating the syntactic and semantic information of a program to support language migration.

[1]  Christopher D. Manning,et al.  Phrasal: a toolkit for statistical machine translation with facilities for extraction and incorporation of arbitrary model features , 2010, HLT-NAACL 2010.

[2]  Maxim Mossienko Automated Cobol to Java recycling , 2003, Seventh European Conference onSoftware Maintenance and Reengineering, 2003. Proceedings..

[3]  David Notkin,et al.  Using twinning to adapt programs to alternative APIs , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[4]  Lu Zhang,et al.  A history-based matching approach to identification of framework evolution , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[5]  Wei Wu,et al.  AURA: a hybrid approach to identify framework evolution , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[6]  Philip Koehn,et al.  Statistical Machine Translation , 2010, EAMT.

[7]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[8]  Norihisa Doi,et al.  SPiCE: A System for Translating Smalltalk Programs Into a C Environment , 1995, IEEE Trans. Software Eng..

[9]  Richard C. Waters Program Translation via Abstraction and Reimplementation , 1988, IEEE Trans. Software Eng..

[10]  Zhendong Su,et al.  On the naturalness of software , 2012, ICSE 2012.

[11]  Qing Wang,et al.  Mining API mapping for language migration , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[12]  Daniel Jurafsky,et al.  Phrasal: A Statistical Machine Translation Toolkit for Exploring New Model Features , 2010, NAACL.