Phrase-Based Statistical Translation of Programming Languages

Phrase-based statistical machine translation approaches have been highly successful in translating between natural languages and are heavily used by commercial systems (e.g. Google Translate). The main objective of this work is to investigate the applicability of these approaches for translating between programming languages. Towards that, we investigated several variants of the phrase-based translation approach: i) a direct application of the approach to programming languages, ii) a novel modification of the approach to incorporate the grammatical structure of the target programming language (so to avoid generating target programs which do not parse), and iii) a combination of ii) with custom rules added to improve the quality of the translation. To experiment with the above systems, we investigated machine translation from C# to Java. For the training, which takes about 60 hours, we used a parallel corpus of 20,499 C#-to-Java method translations. We then evaluated each of the three systems above by translating 1,000 C# methods. Our experimental results indicate that with the most advanced system, about 60% of the translated methods compile (the top ranked) and out of a random sample of 50 correctly compiled methods, 68% (34 methods) were semantically equivalent to the reference solution.

[1]  Rada Mihalcea,et al.  Multilingual Subjectivity Analysis Using Machine Translation , 2008, EMNLP.

[2]  Pushpak Bhattacharyya,et al.  Experiences in Resource Generation for Machine Translation through Crowdsourcing , 2012, LREC.

[3]  Hermann Ney,et al.  Discriminative Training and Maximum Entropy Models for Statistical Machine Translation , 2002, ACL.

[4]  Terence Parr,et al.  The Definitive ANTLR 4 Reference , 2013 .

[5]  Philip Koehn,et al.  Statistical Machine Translation , 2010, EAMT.

[6]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[7]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[8]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[9]  Zhendong Su,et al.  On the naturalness of software , 2012, ICSE 2012.

[10]  Wolfgang Macherey,et al.  Lattice-based Minimum Error Rate Training for Statistical Machine Translation , 2008, EMNLP.

[11]  Jacob Andreas,et al.  Semantic Parsing as Machine Translation , 2013, ACL.

[12]  Manu Sridharan,et al.  Refactoring with synthesis , 2013, OOPSLA.

[13]  Anh Tuan Nguyen,et al.  Lexical statistical machine translation for language migration , 2013, ESEC/FSE 2013.

[14]  Daniel Jurafsky,et al.  Phrasal: A Statistical Machine Translation Toolkit for Exploring New Model Features , 2010, NAACL.

[15]  Jean Senellart,et al.  SYSTRAN intuitive coding technology , 2003, MTSUMMIT.

[16]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[17]  Premkumar T. Devanbu,et al.  On the naturalness of software , 2016, Commun. ACM.

[18]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[19]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[20]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[21]  Christopher D. Manning,et al.  Phrasal: a toolkit for statistical machine translation with facilities for extraction and incorporation of arbitrary model features , 2010, HLT-NAACL 2010.