Unsupervised Translation of Programming Languages

A transcompiler, also known as source-to-source translator, is a system that converts source code from a high-level programming language (such as C++ or Python) to another. Transcompilers are primarily used for interoperability, and to port codebases written in an obsolete or deprecated language (e.g. COBOL, Python 2) to a modern one. They typically rely on handcrafted rewrite rules, applied to the source code abstract syntax tree. Unfortunately, the resulting translations often lack readability, fail to respect the target language conventions, and require manual modifications in order to work properly. The overall translation process is timeconsuming and requires expertise in both the source and target languages, making code-translation projects expensive. Although neural models significantly outperform their rule-based counterparts in the context of natural language translation, their applications to transcompilation have been limited due to the scarcity of parallel data in this domain. In this paper, we propose to leverage recent approaches in unsupervised machine translation to train a fully unsupervised neural transcompiler. We train our model on source code from open source GitHub projects, and show that it can translate functions between C++, Java, and Python with high accuracy. Our method relies exclusively on monolingual source code, requires no expertise in the source or target languages, and can easily be generalized to other programming languages. We also build and release a test set composed of 852 parallel functions, along with unit tests to check the correctness of translations. We show that our model outperforms rule-based commercial baselines by a significant margin.

[1]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[2]  Yuandong Tian,et al.  Coda: An End-to-End Neural Program Decompiler , 2019, NeurIPS.

[3]  Dawn Xiaodong Song,et al.  Tree-to-tree Neural Networks for Program Translation , 2018, NeurIPS.

[4]  Xu Tan,et al.  MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[5]  Eneko Agirre,et al.  Unsupervised Statistical Machine Translation , 2018, EMNLP.

[6]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[7]  Chris Callison-Burch,et al.  Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding , 2006 .

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Denys Poshyvanyk,et al.  SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair , 2018, IEEE Transactions on Software Engineering.

[10]  Graham Neubig,et al.  A Syntactic Neural Model for General-Purpose Code Generation , 2017, ACL.

[11]  Abram Hindle,et al.  Using machine translation for converting Python 2 to Python 3 code , 2015, PeerJ Prepr..

[12]  Philipp Koehn,et al.  Pharaoh: A Beam Search Decoder for Phrase-Based Statistical Machine Translation Models , 2004, AMTA.

[13]  Anh Tuan Nguyen,et al.  Lexical statistical machine translation for language migration , 2013, ESEC/FSE 2013.

[14]  Guillaume Lample,et al.  Unsupervised Machine Translation Using Monolingual Corpora Only , 2017, ICLR.

[15]  Eran Yahav,et al.  Towards Neural Decompilation , 2019, ArXiv.

[16]  Yoshua Bengio,et al.  On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.

[17]  Eneko Agirre,et al.  Unsupervised Neural Machine Translation , 2017, ICLR.

[18]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[19]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[20]  Ke Wang,et al.  Dynamic Neural Program Embedding for Program Repair , 2017, ICLR.

[21]  Uri Alon,et al.  Structural Language Models for Any-Code Generation , 2019, ArXiv.

[22]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[23]  Guillaume Lample,et al.  Phrase-Based & Neural Unsupervised Machine Translation , 2018, EMNLP.

[24]  Yang Liu,et al.  Maximum Expected Likelihood Estimation for Zero-resource Neural Machine Translation , 2017, IJCAI.

[25]  Dan Klein,et al.  Abstract Syntax Networks for Code Generation and Semantic Parsing , 2017, ACL.

[26]  Tie-Yan Liu,et al.  Dual Learning for Machine Translation , 2016, NIPS.

[27]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[28]  Earl T. Barr,et al.  Learning Python Code Suggestion with a Sparse Pointer Network , 2016, ArXiv.

[29]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[30]  Yue Wang,et al.  Code Completion with Neural Attention and Pointer Networks , 2017, IJCAI.

[31]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[32]  Tomoki Toda,et al.  Learning to Generate Pseudo-Code from Source Code Using Statistical Machine Translation (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[33]  Omer Levy,et al.  code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[34]  Martin T. Vechev,et al.  Phrase-Based Statistical Translation of Programming Languages , 2014, Onward!.

[35]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[36]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[37]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[38]  David Lo,et al.  Deep Code Comment Generation , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[39]  Swarat Chaudhuri,et al.  Neural Attribute Machines for Program Generation , 2017, ArXiv.

[40]  Xiaocheng Feng,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, EMNLP.

[41]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[42]  Charles A. Sutton,et al.  Learning natural coding conventions , 2014, SIGSOFT FSE.

[43]  Rico Sennrich,et al.  A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation , 2017, IJCNLP.

[44]  Philipp Koehn,et al.  Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English , 2019, ArXiv.

[45]  Eneko Agirre,et al.  Learning bilingual word embeddings with (almost) no bilingual data , 2017, ACL.

[46]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[47]  Rahul Gupta,et al.  DeepFix: Fixing Common C Language Errors by Deep Learning , 2017, AAAI.

[48]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[49]  Eric Schulte,et al.  Using recurrent neural networks for decompilation , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[50]  Omer Levy,et al.  Structural Language Models of Code , 2019, ICML.

[51]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.