Can OpenAI Codex and Other Large Language Models Help Us Fix Security Bugs?

Human developers can produce code with cybersecurity weaknesses. Can emerging ‘smart’ code completion tools help repair those weaknesses? In this work, we examine the use of large language models (LLMs) for code (such as OpenAI’s Codex and AI21’s Jurassic J-1) for zero-shot vulnerability repair. We investigate challenges in the design of prompts that coax LLMs into generating repaired versions of insecure code. This is difficult due to the numerous ways to phrase key information— both semantically and syntactically—with natural languages. By performing a large scale study of four commercially available, black-box, “off-the-shelf” LLMs, as well as a locally-trained model, on a mix of synthetic, hand-crafted, and real-world security bug scenarios, our experiments show that LLMs could collectively repair 100% of our synthetically generated and hand-crafted scenarios, as well as 58% of vulnerabilities in a selection of historical bugs in real-world open-source projects.

[1]  Vern Paxson,et al.  A Large-Scale Empirical Study of Security Patches , 2017, CCS.

[2]  Henry Lieberman,et al.  NLP (Natural Language Processing) for NLP (Natural Language Programming) , 2006, CICLing.

[3]  Gabriele Bavota,et al.  An Empirical Investigation into Learning Bug-Fixing Patches in the Wild via Neural Machine Translation , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[4]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[5]  Neel Sundaresan,et al.  Generating bug-fixes using pretrained transformers , 2021, MAPS@PLDI.

[6]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[7]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[8]  Elmar Jürgens,et al.  Quality analysis of source code comments , 2013, 2013 21st International Conference on Program Comprehension (ICPC).

[9]  Simon P. Chung,et al.  Automated Bug Hunting With Data-Driven Symbolic Root Cause Analysis , 2021, CCS.

[10]  George Candea,et al.  Failure sketching: a technique for automated root cause diagnosis of in-production failures , 2015, SOSP.

[11]  Gabriele Bavota,et al.  An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation , 2018, ACM Trans. Softw. Eng. Methodol..

[12]  Chao Zhang,et al.  Fuzzing: a survey , 2018, Cybersecur..

[13]  Tri Minh Triet Pham,et al.  The Secret Life of Commented-Out Source Code , 2020, ICPC.

[14]  Vikram S. Adve,et al.  Using likely invariants for automated software fault localization , 2013, ASPLOS '13.

[15]  Sridhar Chimalakonda,et al.  Is there a correlation between code comments and issues?: an exploratory study , 2020, SAC.

[16]  Monperrus Martin Automatic Software Repair: a Bibliography , 2020 .

[17]  Robert Wille,et al.  Generating formal system models from natural language descriptions , 2012, 2012 IEEE International High Level Design Validation and Test Workshop (HLDVT).

[18]  Scott R. Klemmer,et al.  What would other programmers do: suggesting solutions to error messages , 2010, CHI.

[19]  Claire Le Goues,et al.  GenProg: A Generic Method for Automatic Software Repair , 2012, IEEE Transactions on Software Engineering.

[20]  Simon P. Chung,et al.  ARCUS: Symbolic Root Cause Analysis of Exploits in Production Systems , 2021, USENIX Security Symposium.

[21]  Alberto Bacchelli,et al.  Classifying Code Comments in Java Open-Source Software Systems , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[22]  Denys Poshyvanyk,et al.  SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair , 2018, IEEE Transactions on Software Engineering.

[23]  Chamath Keppitiyagama,et al.  Fix that Fix Commit: A real-world remediation analysis of JavaScript projects , 2020, 2020 IEEE 20th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[24]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[25]  Xiang Gao Beyond Tests: Program Vulnerability Repair via Crash Constraint Extraction , 2020 .

[26]  Fan Long,et al.  An analysis of patch plausibility and correctness for generate-and-validate patch generation systems , 2015, ISSTA.

[27]  Thibaud Lutellier,et al.  CURE: Code-Aware Neural Machine Translation for Automatic Program Repair , 2021, 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).

[28]  Ramesh Karri,et al.  An Empirical Cybersecurity Evaluation of GitHub Copilot's Code Contributions , 2021, ArXiv.

[29]  Yuanyuan Zhou,et al.  /*icomment: bugs or bad comments?*/ , 2007, SOSP.

[30]  Jong-Deok Choi,et al.  Isolating failure-inducing thread schedules , 2002, ISSTA '02.

[31]  Charles Sutton,et al.  Program Synthesis with Large Language Models , 2021, ArXiv.

[32]  Rajiv Gupta,et al.  BugFix: A learning-based tool to assist developers in fixing bugs , 2009, 2009 IEEE 17th International Conference on Program Comprehension.

[33]  Andreas Zeller,et al.  Simplifying and Isolating Failure-Inducing Input , 2002, IEEE Trans. Software Eng..

[34]  Derek Bruening,et al.  AddressSanitizer: A Fast Address Sanity Checker , 2012, USENIX Annual Technical Conference.

[35]  Yitong Li,et al.  CoCoNuT: combining context-aware neural translation models using ensemble for program repair , 2020, ISSTA.