TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer

The problem of fixing errors in programs has attracted substantial interest over the years. The key challenge for building an effective code fixing tool is to capture a wide range of errors and meanwhile maintain high accuracy. In this paper, we address this challenge and present a new learning-based system, called TFix. TFix works directly on program text and phrases the problem of code fixing as a text-to-text task. In turn, this enables it to leverage a powerful Transformer based model pre-trained on natural language and fine-tuned to generate code fixes (via a large, highquality dataset obtained from GitHub commits). TFix is not specific to a particular programming language or class of defects and, in fact, improved its precision by simultaneously fine-tuning on 52 different error types reported by a popular static analyzer. Our evaluation on a massive dataset of JavaScript programs shows that TFix is practically effective: it is able to synthesize code that fixes the error in ∼67 percent of cases and significantly outperforms existing learning-based approaches.

[1]  Yuriy Brun,et al.  Is the cure worse than the disease? overfitting in automated program repair , 2015, ESEC/SIGSOFT FSE.

[2]  Andrew Rice,et al.  Learning to Fix Build Errors with Graph2Diff Neural Networks , 2019, ICSE.

[3]  Xiaocheng Feng,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, EMNLP.

[4]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Sarfraz Khurshid,et al.  Towards Practical Program Repair with On-demand Candidate Generation , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[7]  Yitong Li,et al.  CoCoNuT: combining context-aware neural translation models using ensemble for program repair , 2020, ISSTA.

[8]  Omer Levy,et al.  Structural Language Models of Code , 2019, ICML.

[9]  Dawei Qi,et al.  SemFix: Program repair via semantic analysis , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[10]  Isil Dillig,et al.  LambdaNet: Probabilistic Type Inference using Graph Neural Networks , 2020, ICLR.

[11]  Andrea Janes,et al.  Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[12]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[13]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[14]  Martin Monperrus,et al.  NPEFix: Automatic Runtime Repair of Null Pointer Exceptions in Java , 2015, ArXiv.

[15]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[16]  Ke Wang,et al.  Dynamic Neural Program Embedding for Program Repair , 2017, ICLR.

[17]  Graham Neubig,et al.  Learning to Represent Edits , 2018, ICLR.

[18]  Loris D'Antoni,et al.  Learning Quick Fixes from Code Repositories , 2018, SBES.

[19]  Matias Martinez,et al.  A Comprehensive Study of Automatic Program Repair on the QuixBugs Benchmark , 2018, 2019 IEEE 1st International Workshop on Intelligent Bug Fixing (IBF).

[20]  Le Song,et al.  Hoppity: Learning Graph Transformations to Detect and Fix Bugs in Programs , 2020, ICLR.

[21]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[22]  Hang Li,et al.  “ Tony ” DNN Embedding for “ Tony ” Selective Read for “ Tony ” ( a ) Attention-based Encoder-Decoder ( RNNSearch ) ( c ) State Update s 4 SourceVocabulary Softmax Prob , 2016 .

[23]  Johannes Bader,et al.  Getafix: learning to fix bugs automatically , 2019, Proc. ACM Program. Lang..

[24]  Yulei Sui,et al.  Flow2Vec: value-flow-based precise code embedding , 2020, Proc. ACM Program. Lang..

[25]  Rahul Gupta,et al.  Deep Reinforcement Learning for Programming Language Correction , 2018, ArXiv.

[26]  Rémi Louf,et al.  Transformers : State-ofthe-art Natural Language Processing , 2019 .

[27]  Jure Leskovec,et al.  Language-Agnostic Representation Learning of Source Code from Structure and Context , 2021, ICLR.

[28]  Marcelo de Almeida Maia,et al.  Dissection of a bug dataset: Anatomy of 395 patches from Defects4J , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[29]  Monperrus Martin Automatic Software Repair: a Bibliography , 2020 .

[30]  Rahul Gupta,et al.  DeepFix: Fixing Common C Language Errors by Deep Learning , 2017, AAAI.

[31]  Aditya Kanade,et al.  Neural Program Repair by Jointly Learning to Localize and Repair , 2019, ICLR.

[32]  Zheng Gao,et al.  Typilus: neural type hints , 2020, PLDI.

[33]  Denys Poshyvanyk,et al.  SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair , 2018, IEEE Transactions on Software Engineering.

[34]  Daniela Micucci,et al.  Automatic Software Repair: A Survey , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[35]  Armando Solar-Lezama,et al.  sk_p: a neural program corrector for MOOCs , 2016, SPLASH.

[36]  Oleksandr Polozov,et al.  Generative Code Modeling with Graphs , 2018, ICLR.

[37]  Zack Coker,et al.  Program transformations to fix C integers , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[38]  Rishabh Singh,et al.  Global Relational Models of Source Code , 2020, ICLR.

[39]  Fan Long,et al.  An analysis of patch plausibility and correctness for generate-and-validate patch generation systems , 2015, ISSTA.

[40]  Abhik Roychoudhury,et al.  Angelix: Scalable Multiline Program Patch Synthesis via Symbolic Analysis , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[41]  Marc Brockschmidt,et al.  Learning to Represent Programs with Graphs , 2017, ICLR.

[42]  Fan Long,et al.  Automatic patch generation by learning correct code , 2016, POPL.

[43]  Koushik Sen,et al.  DeepBugs: a learning approach to name-based bug detection , 2018, Proc. ACM Program. Lang..

[44]  Aditya Kanade,et al.  Learning and Evaluating Contextual Embedding of Source Code , 2019, ICML.

[45]  Percy Liang,et al.  Graph-based, Self-Supervised Program Repair from Diagnostic Feedback , 2020, ICML.

[46]  Martin Monperrus,et al.  Nopol: Automatic Repair of Conditional Statement Bugs in Java Programs , 2018, IEEE Transactions on Software Engineering.