An Empirical Cybersecurity Evaluation of GitHub Copilot's Code Contributions

There is burgeoning interest in designing AI-based systems to assist humans in designing computing systems, including tools that automatically generate computer code. The most notable of these comes in the form of the first selfdescribed ‘AI pair programmer’, GitHub Copilot, a language model trained over open-source GitHub code. However, code often contains bugs—and so, given the vast quantity of unvetted code that Copilot has processed, it is certain that the language model will have learned from exploitable, buggy code. This raises concerns on the security of Copilot’s code contributions. In this work, we systematically investigate the prevalence and conditions that can cause GitHub Copilot to recommend insecure code. To perform this analysis we prompt Copilot to generate code in scenarios relevant to high-risk CWEs (e.g. those from MITRE’s “Top 25” list). We explore Copilot’s performance on three distinct code generation axes—examining how it performs given diversity of weaknesses, diversity of prompts, and diversity of domains. In total, we produce 89 different scenarios for Copilot to complete, producing 1,692 programs. Of these, we found approximately 40 % to be vulnerable.

[1]  Thibaud Lutellier,et al.  CURE: Code-Aware Neural Machine Translation for Automatic Program Repair , 2021, 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).

[2]  Chamath Keppitiyagama,et al.  Fix that Fix Commit: A real-world remediation analysis of JavaScript projects , 2020, 2020 IEEE 20th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[3]  Mark Fischer,et al.  Hardware Penetration Testing Knocks Your SoCs Off , 2020, IEEE Design & Test.

[4]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[5]  Xuanjing Huang,et al.  Recurrent Neural Network for Text Classification with Multi-Task Learning , 2016, IJCAI.

[6]  Robert Wille,et al.  Generating formal system models from natural language descriptions , 2012, 2012 IEEE International High Level Design Validation and Test Workshop (HLDVT).

[7]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[8]  Jeffrey J. P. Tsai,et al.  Machine Learning and Software Engineering , 2004, Software Quality Journal.

[9]  Charles Sutton,et al.  Program Synthesis with Large Language Models , 2021, ArXiv.

[10]  Henry Lieberman,et al.  NLP (Natural Language Processing) for NLP (Natural Language Programming) , 2006, CICLing.

[11]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[12]  Danqi Chen,et al.  CoQA: A Conversational Question Answering Challenge , 2018, TACL.

[13]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[14]  Jeyavijayan Rajendran,et al.  HardFails: Insights into Software-Exploitable Hardware Bugs , 2019, USENIX Security Symposium.

[15]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[16]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[17]  Ian G. Harris,et al.  GLAsT: Learning formal grammars to translate natural language specifications into hardware assertions , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[18]  Chao Zhang,et al.  Fuzzing: a survey , 2018, Cybersecur..

[19]  Jakub Szefer,et al.  SecChisel Framework for Security Verification of Secure Processor Architectures , 2019, HASP@ISCA.

[20]  Yao Wang,et al.  A Hardware Design Language for Timing-Sensitive Information-Flow Security , 2015, ASPLOS.

[21]  K. M. Tahsin Hassan Rahit,et al.  Machine Translation from Natural Language to Code using Long-Short Term Memory , 2019, ArXiv.