Repairing Security Vulnerabilities Using Pre-trained Programming Language Models

Repairing software bugs with automated solutions is a long-standing goal of researchers. Some of the latest automated program repair (APR) tools leverage natural language processing (NLP) techniques to repair software bugs. But natural languages (NL) and programming languages (PL) have significant differences, which leads to the fact that they may not be able to handle PL tasks well. Moreover, due to the difference between the vulnerability repair task and bug repair task, the performance of these tools on vulnerability repair is not yet known. To address these issues, we attempt to use large-scale pre-trained PL models (CodeBERT and GraphCodeBERT) for the vulnerability repair task based on the characteristics of PL and explore the real-world performance of the state-of-the-art data-driven approaches for vulnerability repair. The results show that using pre-trained PL models can better capture and process PL features and accomplish multi-line vulnerability repair. Specifically, our solution achieves advanced results (single-line repair accuracy 95.47%, multi-line repair accuracy 90.06%). These results outperform the state-of-the-art data-driven approaches and demonstrate that adding rich data-dependent features can help solve more complex code repair problems. Besides, we also discuss the previous work and our approach, pointing out some shortcomings and solutions we can work on in the future.

[1]  B. Luo,et al.  Neural Program Repair : Systems, Challenges and Solutions , 2022, Internetware.

[2]  Yue Wang,et al.  CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation , 2021, EMNLP.

[3]  Kai-Wei Chang,et al.  Unified Pre-training for Program Understanding and Generation , 2021, NAACL.

[4]  Neel Sundaresan,et al.  CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation , 2021, NeurIPS Datasets and Benchmarks.

[5]  Ming Zhou,et al.  GraphCodeBERT: Pre-training Code Representations with Data Flow , 2020, ICLR.

[6]  Denys Poshyvanyk,et al.  SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair , 2018, IEEE Transactions on Software Engineering.

[7]  Martin Vechev,et al.  TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer , 2021, ICML.

[8]  Yitong Li,et al.  CoCoNuT: combining context-aware neural translation models using ensemble for program repair , 2020, ISSTA.

[9]  Shaohua Wang,et al.  DLFix: Context-based Code Transformation Learning for Automated Program Repair , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[10]  Ting Liu,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, FINDINGS.

[11]  Aditya Kanade,et al.  Learning and Evaluating Contextual Embedding of Source Code , 2019, ICML.

[12]  Claire Le Goues,et al.  Automated program repair , 2019, Commun. ACM.

[13]  Gabriele Bavota,et al.  An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation , 2018, ACM Trans. Softw. Eng. Methodol..

[14]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.