Bran: Reduce Vulnerability Search Space in Large Open Source Repositories by Learning Bug Symptoms

Software is continually increasing in size and complexity, and therefore, vulnerability discovery would benefit from techniques that identify potentially vulnerable regions within large code bases, as this allows for easing vulnerability detection by reducing the search space. Previous work has explored the use of conventional code-quality and complexity metrics in highlighting suspicious sections of (source) code. Recently, researchers also proposed to reduce the vulnerability search space by studying code properties with neural networks. However, previous work generally failed in leveraging the rich metadata that is available for long-running, large code repositories. In this paper, we present an approach, named Bran, to reduce the vulnerability search space by combining conventional code metrics with fine-grained repository metadata. Bran locates code sections that are more likely to contain vulnerabilities in large code bases, potentially improving the efficiency of both manual and automatic code audits. In our experiments on four large code bases, Bran successfully highlights potentially vulnerable functions, outperforming several baselines, including state-of-art vulnerability prediction tools. We also assess Bran's effectiveness in assisting automated testing tools. We use Bran to guide syzkaller, a known kernel fuzzer, in fuzzing a recent version of the Linux kernel. The guided fuzzer identifies 26 bugs (10 are zero-day flaws), including arbitrary writes and reads.

[1]  Benjamin Livshits,et al.  Finding Security Vulnerabilities in Java Applications with Static Analysis , 2005, USENIX Security Symposium.

[2]  Rakesh M. Verma,et al.  Machine Learning Methods for Software Vulnerability Detection , 2018, IWSPA@CODASPY.

[3]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[4]  Tibor Gyimóthy,et al.  Empirical validation of object-oriented metrics on open source software for fault prediction , 2005, IEEE Transactions on Software Engineering.

[5]  Cristina V. Lopes,et al.  SourcererCC: Scaling Code Clone Detection to Big-Code , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[6]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[7]  Shouling Ji,et al.  VulSniper: Focus Your Attention to Shoot Fine-Grained Vulnerabilities , 2019, IJCAI.

[8]  James R. Larus,et al.  Righting software , 2004, IEEE Software.

[9]  David Lo,et al.  File-Level Defect Prediction: Unsupervised vs. Supervised Models , 2017, 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM).

[10]  Davide Spadini,et al.  PyDriller: Python framework for mining software repositories , 2018, ESEC/SIGSOFT FSE.

[11]  Onur Ozdemir,et al.  Automated Vulnerability Detection in Source Code Using Deep Representation Learning , 2018, 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA).

[12]  Laurie A. Williams,et al.  Evaluating Complexity, Code Churn, and Developer Activity Metrics as Indicators of Software Vulnerabilities , 2011, IEEE Transactions on Software Engineering.

[13]  Yuming Zhou,et al.  Effort-aware just-in-time defect prediction: simple unsupervised models could be better than supervised models , 2016, SIGSOFT FSE.

[14]  Tim Menzies,et al.  Revisiting unsupervised learning for defect prediction , 2017, ESEC/SIGSOFT FSE.

[15]  David W. Binkley,et al.  Interprocedural slicing using dependence graphs , 1990, TOPL.

[16]  Barbara G. Ryder,et al.  Constructing the Call Graph of a Program , 1979, IEEE Transactions on Software Engineering.

[17]  Matthew Smith,et al.  VCCFinder: Finding Potential Vulnerabilities in Open-Source Projects to Assist Code Audits , 2015, CCS.

[18]  Sang Peter Chin,et al.  Automated software vulnerability detection with machine learning , 2018, ArXiv.

[19]  Konrad Rieck,et al.  Modeling and Discovering Vulnerabilities with Code Property Graphs , 2014, 2014 IEEE Symposium on Security and Privacy.

[20]  Jongmoon Baik,et al.  Improving vulnerability prediction accuracy with Secure Coding Standard violation measures , 2016, 2016 International Conference on Big Data and Smart Computing (BigComp).

[21]  Ruchika Malhotra,et al.  A systematic review of machine learning techniques for software fault prediction , 2015, Appl. Soft Comput..

[22]  Zhi Jin,et al.  Building Program Vector Representations for Deep Learning , 2014, KSEM.

[23]  Mohammad Zulkernine,et al.  Using complexity, coupling, and cohesion metrics as early indicators of vulnerabilities , 2011, J. Syst. Archit..

[24]  Felix FX Lindner,et al.  Vulnerability Extrapolation: Assisted Discovery of Vulnerabilities Using Machine Learning , 2011, WOOT.

[25]  Hoan Anh Nguyen,et al.  Detection of recurring software vulnerabilities , 2010, ASE.

[26]  Shouhuai Xu,et al.  VulDeePecker: A Deep Learning-Based System for Vulnerability Detection , 2018, NDSS.

[27]  Maurice J. Bach The Design of the UNIX Operating System , 1986 .

[28]  Christopher Krügel,et al.  DR. CHECKER: A Soundy Analysis for Linux Kernel Drivers , 2017, USENIX Security Symposium.

[29]  Vitaly Shmatikov,et al.  RoleCast: finding missing security checks when you do not know what checks are , 2011, OOPSLA '11.

[30]  Miguel Correia,et al.  DEKANT: a static analysis tool that learns to detect web application vulnerabilities , 2016, ISSTA.

[31]  Wouter Joosen,et al.  Predicting Vulnerable Software Components via Text Mining , 2014, IEEE Transactions on Software Engineering.

[32]  Konrad Rieck,et al.  Chucky: exposing missing checks in source code for vulnerability discovery , 2013, CCS.

[33]  Yang Liu,et al.  Accurate and Scalable Cross-Architecture Cross-OS Binary Code Search with Emulation , 2019, IEEE Transactions on Software Engineering.

[34]  Viet Hung Nguyen,et al.  Predicting vulnerable software components with dependency graphs , 2010, MetriSec '10.

[35]  RadjenovićDanijel,et al.  Software fault prediction metrics , 2013 .

[36]  Xiao Ma,et al.  AutoISES: Automatically Inferring Security Specification and Detecting Violations , 2008, USENIX Security Symposium.

[37]  Heejo Lee,et al.  VUDDY: A Scalable Approach for Vulnerable Code Clone Discovery , 2017, 2017 IEEE Symposium on Security and Privacy (SP).

[38]  Ashkan Sami,et al.  Using complexity metrics to improve software security , 2013 .

[39]  Konrad Rieck,et al.  Automatic Inference of Search Patterns for Taint-Style Vulnerabilities , 2015, 2015 IEEE Symposium on Security and Privacy.

[40]  Andrew Meneely,et al.  An empirical investigation of socio-technical code review metrics and security vulnerabilities , 2014, SSE@SIGSOFT FSE.

[41]  Shouhuai Xu,et al.  SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities , 2018, IEEE Transactions on Dependable and Secure Computing.

[42]  Yu Jiang,et al.  LEOPARD: Identifying Vulnerable Code for Vulnerability Assessment Through Program Metrics , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[43]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[44]  Hongfang Liu,et al.  Theory of relative defect proneness , 2008, Empirical Software Engineering.

[45]  Andreas Zeller,et al.  Predicting vulnerable software components , 2007, CCS '07.

[46]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[47]  Shuvendu K. Lahiri,et al.  Towards Practical Reactive Security Audit Using Extended Static Checkers , 2013, 2013 IEEE Symposium on Security and Privacy.

[48]  Sebastian G. Elbaum,et al.  Code churn: a measure for estimating the impact of code change , 1998, Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272).

[49]  Tracy Hall,et al.  A Systematic Literature Review on Fault Prediction Performance in Software Engineering , 2012, IEEE Transactions on Software Engineering.

[50]  Konrad Rieck,et al.  Generalized vulnerability extrapolation using abstract syntax trees , 2012, ACSAC '12.

[51]  Yuming Zhou,et al.  How Far We Have Progressed in the Journey? An Examination of Cross-Project Defect Prediction , 2018, ACM Trans. Softw. Eng. Methodol..

[52]  Banu Diri,et al.  A systematic review of software fault prediction studies , 2009, Expert Syst. Appl..

[53]  Michael D. Ernst,et al.  CBCD: Cloned buggy code detector , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[54]  Andrew Meneely,et al.  When a Patch Goes Bad: Exploring the Properties of Vulnerability-Contributing Commits , 2013, 2013 ACM / IEEE International Symposium on Empirical Software Engineering and Measurement.

[55]  Peiyuan Zong,et al.  SemFuzz: Semantics-based Automatic Generation of Proof-of-Concept Exploits , 2017, CCS.

[56]  Zhe Yang,et al.  Modular checking for buffer overflows in the large , 2006, ICSE.

[57]  Gary McGraw,et al.  ITS4: a static vulnerability scanner for C and C++ code , 2000, Proceedings 16th Annual Computer Security Applications Conference (ACSAC'00).

[58]  Ying Zou,et al.  Cross-Project Defect Prediction Using a Connectivity-Based Unsupervised Classifier , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[59]  Arthur Griffith GCC, the complete reference , 2002 .

[60]  Laurie A. Williams,et al.  An empirical model to predict security vulnerabilities using code complexity metrics , 2008, ESEM '08.