Neural Transfer Learning for Repairing Security Vulnerabilities in C Code

In this paper, we address the problem of automatic repair of software vulnerabilities with deep learning. The major problem with data-driven vulnerability repair is that the few existing datasets of known confirmed vulnerabilities consist of only a few thousand examples. However, training a deep learning model often requires hundreds of thousands of examples. In this work, we leverage the intuition that the bug fixing task and the vulnerability fixing task are related, and the knowledge learned from bug fixes can be transferred to fixing vulnerabilities. In the machine learning community, this technique is called transfer learning. In this paper, we propose an approach for repairing security vulnerabilities named VRepair which is based on transfer learning. VRepair is first trained on a large bug fix corpus, and is then tuned on a vulnerability fix dataset, which is an order of magnitudes smaller. In our experiments, we show that a model trained only on a bug fix corpus can already fix some vulnerabilities. Then, we demonstrate that transfer learning improves the ability to repair vulnerable C functions. In the end, we present evidence that transfer learning produces more stable and superior neural models for vulnerability repair.

[1]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[2]  Hamid Reza Shahriari,et al.  Seven Years of Software Vulnerabilities: The Ebb and Flow , 2017, IEEE Security & Privacy.

[3]  Ting Liu,et al.  SeqTrans: Automatic Vulnerability Fix Via Sequence to Sequence Learning , 2020, IEEE Transactions on Software Engineering.

[4]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[5]  Miltiadis Allamanis,et al.  The adverse effects of code duplication in machine learning models of code , 2018, Onward!.

[6]  Baishakhi Ray,et al.  CODIT: Code Editing With Tree-Based Neural Models , 2020, IEEE Transactions on Software Engineering.

[7]  Sheng Wen,et al.  Software Vulnerability Detection Using Deep Neural Networks: A Survey , 2020, Proceedings of the IEEE.

[8]  Der-Chiang Li,et al.  Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge , 2007, Comput. Oper. Res..

[9]  Xiaocheng Feng,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, EMNLP.

[10]  David Brumley,et al.  The Mayhem Cyber Reasoning System , 2018, IEEE Security & Privacy.

[11]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Claudia Eckert,et al.  IntRepair: Informed Repairing of Integer Overflows , 2018, IEEE Transactions on Software Engineering.

[13]  Cristiano Calcagno,et al.  Infer: An Automatic Program Verifier for Memory Safety of C Programs , 2011, NASA Formal Methods.

[14]  Xiaobing Sun,et al.  BGNN4VD: Constructing Bidirectional Graph Neural-Network for Vulnerability Detection , 2021, Inf. Softw. Technol..

[15]  Robert H. Deng,et al.  VuRLE: Automatic Vulnerability Detection and Repair by Learning from Examples , 2017, ESORICS.

[16]  Hakjoo Oh,et al.  MemFix: static analysis-based repair of memory deallocation errors for C , 2018, ESEC/SIGSOFT FSE.

[17]  Richard Torkar,et al.  Software fault prediction metrics: A systematic literature review , 2013, Inf. Softw. Technol..

[18]  Premkumar T. Devanbu,et al.  On the naturalness of software , 2016, Commun. ACM.

[19]  Premkumar T. Devanbu,et al.  A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[20]  Kaiming He,et al.  Rethinking ImageNet Pre-Training , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Trent Jaeger,et al.  Using Safety Properties to Generate Vulnerability Patches , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[22]  Kelly Klima,et al.  Estimating the Global Cost of Cyber Risk , 2018 .

[23]  Denys Poshyvanyk,et al.  SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair , 2018, IEEE Transactions on Software Engineering.

[24]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[25]  Michael D. Ernst,et al.  Automatically patching errors in deployed software , 2009, SOSP '09.

[26]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[27]  Onur Ozdemir,et al.  Automated Vulnerability Detection in Source Code Using Deep Representation Learning , 2018, 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA).

[28]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[29]  Hang Li,et al.  “ Tony ” DNN Embedding for “ Tony ” Selective Read for “ Tony ” ( a ) Attention-based Encoder-Decoder ( RNNSearch ) ( c ) State Update s 4 SourceVocabulary Softmax Prob , 2016 .

[30]  Shaohua Wang,et al.  DLFix: Context-based Code Transformation Learning for Automated Program Repair , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[31]  Gavriel Salomon,et al.  T RANSFER OF LEARNING , 1992 .

[32]  Dimitris Mitropoulos,et al.  VulinOSS: A Dataset of Security Vulnerabilities in Open-Source Systems , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[33]  Matias Martinez,et al.  A Software-Repair Robot Based on Continual Learning , 2021, IEEE Software.

[34]  Shouhuai Xu,et al.  VulDeePecker: A Deep Learning-Based System for Vulnerability Detection , 2018, NDSS.

[35]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[36]  Rui Abreu,et al.  A ground-truth dataset of real security patches , 2021, ArXiv.

[37]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[38]  Rahul Gupta,et al.  DeepFix: Fixing Common C Language Errors by Deep Learning , 2017, AAAI.

[39]  Chao Yang,et al.  A Survey on Deep Transfer Learning , 2018, ICANN.

[40]  Ali Mesbah,et al.  DeepDelta: learning to repair compilation errors , 2019, ESEC/SIGSOFT FSE.

[41]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[42]  Sang Peter Chin,et al.  Learning to Repair Software Vulnerabilities with Generative Adversarial Networks , 2018, NeurIPS.

[43]  Gabriele Bavota,et al.  An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation , 2018, ACM Trans. Softw. Eng. Methodol..

[44]  Thibaud Lutellier,et al.  ENCORE: Ensemble Learning using Convolution Neural Machine Translation for Automatic Program Repair , 2019, ArXiv.

[45]  Aditya Kanade,et al.  Learning and Evaluating Contextual Embedding of Source Code , 2019, ICML.

[46]  Percy Liang,et al.  Graph-based, Self-Supervised Program Repair from Diagnostic Feedback , 2020, ICML.

[47]  Michael D. Ernst,et al.  Defects4J: a database of existing faults to enable controlled testing studies for Java programs , 2014, ISSTA 2014.

[48]  Aurelien Delaitre,et al.  Report on the Static Analysis Tool Exposition (SATE) IV , 2013 .

[49]  Omer Levy,et al.  code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[50]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[51]  Wenke Lee,et al.  Diagnosis and Emergency Patch Generation for Integer Overflow Exploits , 2014, DIMVA.

[52]  Xiang Gao Beyond Tests: Program Vulnerability Repair via Crash Constraint Extraction , 2020 .

[53]  Kai-Wei Chang,et al.  Unified Pre-training for Program Understanding and Generation , 2021, NAACL.

[54]  Yingfei Xiong,et al.  A syntax-guided edit decoder for neural program repair , 2021, ESEC/SIGSOFT FSE.

[55]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[56]  Leon Moonen,et al.  CVEfixes: automated collection of vulnerabilities and their fixes from open-source software , 2021, PROMISE.

[57]  Philipp Koehn,et al.  Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.

[58]  Lutz Prechelt,et al.  Early Stopping - But When? , 2012, Neural Networks: Tricks of the Trade.

[59]  Lu Zhang,et al.  Safe Memory-Leak Fixing for C Programs , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[60]  Shan Huang,et al.  Application of Seq2Seq Models on Code Correction , 2020, Frontiers in Artificial Intelligence.

[61]  Bowen Zhou,et al.  Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , 2016, CoNLL.

[62]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[63]  Yitong Li,et al.  CoCoNuT: combining context-aware neural translation models using ensemble for program repair , 2020, ISSTA.

[64]  Andrew Rice,et al.  Learning to Fix Build Errors with Graph2Diff Neural Networks , 2019, ICSE.

[65]  Shaohua Wang,et al.  A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries , 2020, 2020 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR).

[66]  Graham Neubig,et al.  Learning to Represent Edits , 2018, ICLR.

[67]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[68]  Miguel A. L. Marques,et al.  Predicting the Thermodynamic Stability of Solids Combining Density Functional Theory and Machine Learning , 2017 .

[69]  Graham Neubig,et al.  Cross-Lingual Word Embeddings for Low-Resource Language Modeling , 2017, EACL.

[70]  Xi Zhang,et al.  The Coming Era of AlphaHacking?: A Survey of Automatic Software Vulnerability Detection, Exploitation and Patching Techniques , 2018, 2018 IEEE Third International Conference on Data Science in Cyberspace (DSC).

[71]  Michele Bezzi,et al.  A Manually-Curated Dataset of Fixes to Vulnerabilities of Open-Source Software , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[72]  Diomidis Spinellis,et al.  Code smell detection by deep direct-learning and transfer-learning , 2021, J. Syst. Softw..

[73]  Sumit Gulwani,et al.  Compilation Error Repair: For the Student Programs, From the Student Programs , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering: Software Engineering Education and Training (ICSE-SEET).