Improving automatically generated code from Codex via Automated Program Repair

Large language models, e.g., Codex and AlphaCode, have shown capability in producing working code for many programming tasks. However, the success rate of existing models remains low, especially for complex programming tasks. One of the reasons is that language models lack awareness of program semantics (e.g., type informa-tion), resulting in incorrect programs (or even programs which do not compile). In this paper, we systematically study whether automated program repair (APR) techniques can fix the incorrect solutions produced by language models in LeetCode contests. The goal is to study whether APR techniques can enhance confidence in the code produced by language models. Our study revealed that: (1) automatically generated codes share some common programming mistakes with human-crafted solutions, indicating existing APR tools have the potential to fix auto-generated code; (2) TBar and Recoder, two well-known Java APR tools based on templates and learning respectively, increase the number of solved tasks from 37 to 42 on 60 easy-level tasks, while increase from 5 to 9 on 53 medium-level programming tasks; (3) given bug location information provided by a statistical fault localization approach, the newly released Codex edit mode, which supports changing existing code, may outperform existing APR tools in fixing incorrect solutions. By analyzing the experimental results generated by these tools, we provide several suggestions: (1) as existing APR techniques are still quite limited, including limited patch space, fix locations and patch size, enhancing APR tool to surpass these limitations (e.g., introducing a more flexible fault localization strategy) is desirable; (2) as large language models can derive more fix patterns by training on more data, future APR tools should shift focus from adding more patterns to encoding more program semantics

[1]  J. Steinhardt,et al.  Capturing Failures of Large Language Models via Human Cognitive Biases , 2022, NeurIPS.

[2]  Cherepanov,et al.  Competition-level code generation with AlphaCode , 2022, Science.

[3]  Nagarajan Natarajan,et al.  Jigsaw: Large Language Models meet Program Synthesis , 2021, 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE).

[4]  Ramesh Karri,et al.  Can OpenAI Codex and Other Large Language Models Help Us Fix Security Bugs? , 2021, ArXiv.

[5]  Romain Robbes,et al.  Automatic Program Repair with OpenAI's Codex: Evaluating QuixBugs , 2021, ArXiv.

[6]  Sumit Gulwani,et al.  Multi-modal program inference: a marriage of pre-trained language models and component-based synthesis , 2021, Proc. ACM Program. Lang..

[7]  Charles Sutton,et al.  Program Synthesis with Large Language Models , 2021, ArXiv.

[8]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[9]  Lu Zhang,et al.  A syntax-guided edit decoder for neural program repair , 2021, ESEC/SIGSOFT FSE.

[10]  Veronika Thost,et al.  CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks , 2021, NeurIPS Datasets and Benchmarks.

[11]  Dawn Song,et al.  Measuring Coding Challenge Competence With APPS , 2021, NeurIPS Datasets and Benchmarks.

[12]  Thibaud Lutellier,et al.  CURE: Code-Aware Neural Machine Translation for Automatic Program Repair , 2021, 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).

[13]  Denys Poshyvanyk,et al.  SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair , 2018, IEEE Transactions on Software Engineering.

[14]  Yitong Li,et al.  CoCoNuT: combining context-aware neural translation models using ensemble for program repair , 2020, ISSTA.

[15]  Shaohua Wang,et al.  DLFix: Context-based Code Transformation Learning for Automated Program Repair , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[16]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[17]  Percy Liang,et al.  Graph-based, Self-Supervised Program Repair from Diagnostic Feedback , 2020, ICML.

[18]  Wolfgang Banzhaf,et al.  ARJA: Automated Repair of Java Programs via Multi-Objective Genetic Programming , 2017, IEEE Transactions on Software Engineering.

[19]  Claire Le Goues,et al.  Automated program repair , 2019, Commun. ACM.

[20]  Tegawendé F. Bissyandé,et al.  TBar: revisiting template-based automated program repair , 2019, ISSTA.

[21]  Tegawendé F. Bissyandé,et al.  AVATAR: Fixing Semantic Bugs with Fix Patterns of Static Analysis Violations , 2018, 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[22]  Xiang Gao,et al.  Test-Equivalence Analysis for Automatic Patch Generation , 2018, ACM Trans. Softw. Eng. Methodol..

[23]  Hongyu Zhang,et al.  Shaping program repair space with existing patches and similar code , 2018, ISSTA.

[24]  Xiang Gao,et al.  Repairing Crashes in Android Apps , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[25]  Ming Wen,et al.  Context-Aware Patch Generation for Better Automated Program Repair , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[26]  Sumit Gulwani,et al.  Compilation Error Repair: For the Student Programs, From the Student Programs , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering: Software Engineering Education and Training (ICSE-SEET).

[27]  Abhik Roychoudhury,et al.  A correlation study between automated program repair and test-suite metrics , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[28]  Martin Monperrus,et al.  Nopol: Automatic Repair of Conditional Statement Bugs in Java Programs , 2018, IEEE Transactions on Software Engineering.

[29]  Amey Karkare,et al.  A feasibility study of using automated program repair for introductory programming assignments , 2017, ESEC/SIGSOFT FSE.

[30]  Abhik Roychoudhury,et al.  Codeflaws: A Programming Competition Benchmark for Evaluating Automated Program Repair Tools , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C).

[31]  Rahul Gupta,et al.  DeepFix: Fixing Common C Language Errors by Deep Learning , 2017, AAAI.

[32]  Hiroaki Yoshida,et al.  Anti-patterns in search-based program repair , 2016, SIGSOFT FSE.

[33]  Martin White,et al.  Deep learning code fragments for code clone detection , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[34]  Matias Martinez,et al.  ASTOR: a program repair library for Java (demo) , 2016, ISSTA.

[35]  Abhik Roychoudhury,et al.  Angelix: Scalable Multiline Program Patch Synthesis via Symbolic Analysis , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[36]  David Lo,et al.  History Driven Program Repair , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[37]  Yuriy Brun,et al.  Is the cure worse than the disease? overfitting in automated program repair , 2015, ESEC/SIGSOFT FSE.

[38]  Abhik Roychoudhury,et al.  relifix: Automated Repair of Software Regressions , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[39]  Michael D. Ernst,et al.  Defects4J: a database of existing faults to enable controlled testing studies for Java programs , 2014, ISSTA 2014.

[40]  Dawei Qi,et al.  SemFix: Program repair via semantic analysis , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[41]  Jaechang Nam,et al.  Automatic patch generation learned from human-written patches , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[42]  Rui Abreu,et al.  GZoltar: an eclipse plug-in for testing and debugging , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[43]  Claire Le Goues,et al.  Automatically finding patches using genetic programming , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[44]  A.J.C. van Gemund,et al.  On the Accuracy of Spectrum-based Fault Localization , 2007, Testing: Academic and Industrial Conference Practice and Research Techniques - MUTATION (TAICPART-MUTATION 2007).