Automated Mapping of Vulnerability Advisories onto their Fix Commits in Open Source Repositories

The lack of comprehensive sources of accurate vulnerability data represents a critical obstacle to studying and understanding software vulnerabilities (and their corrections). In this paper, we present an approach that combines heuristics stemming from practical experience and machine-learning (ML)—specifically, natural language processing (NLP)—to address this problem. Our method consists of three phases. First, an advisory record containing key information about a vulnerability is extracted from an advisory (expressed in natural language). Second, using heuristics, a subset of candidate fix commits is obtained from the source code repository of the affected project by filtering out commits that are known to be irrelevant for the task at hand. Finally, for each such candidate commit, our method builds a numerical feature vector reflecting the characteristics of the commit that are relevant to predicting its match with the advisory at hand. The feature vectors are then exploited for building a final ranked list of candidate fixing commits. The score attributed by the ML model to each feature is kept visible to the users, allowing them to interpret of the predictions. We evaluated our approach using a prototype implementation named Prospector on a manually curated data set that comprises 2,391 known fix commits corresponding to 1,248 public vulnerability advisories. When considering the top-10 commits in the ranked results, our implementation could successfully identify at least one fix commit for up to 84.03% of the vulnerabilities (with a fix commit on the first position for 65.06% of the vulnerabilities). In conclusion, our method reduces considerably the effort needed to search OSS repositories for the commits that fix known vulnerabilities.

[1]  Serena Elisa Ponta,et al.  Impact assessment for vulnerabilities in open-source software libraries , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[2]  Michele Bezzi,et al.  A Manually-Curated Dataset of Fixes to Vulnerabilities of Open-Source Software , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[3]  Christine Morin,et al.  Automated Keyword Extraction from "One-day" Vulnerabilities at Disclosure , 2020, NOMS 2020 - 2020 IEEE/IFIP Network Operations and Management Symposium.

[4]  Yves Le Traon,et al.  Enabling the Continous Analysis of Security Vulnerabilities with VulData7 , 2018 .

[5]  Matthew Smith,et al.  VCCFinder: Finding Potential Vulnerabilities in Open-Source Projects to Assist Code Audits , 2015, CCS.

[6]  Jacques Klein,et al.  Learning to Catch Security Patches , 2020, ArXiv.

[7]  Wenbo Guo,et al.  Towards the Detection of Inconsistencies in Public Security Vulnerability Reports , 2019, USENIX Security Symposium.

[8]  Michele Bezzi,et al.  A Practical Approach to the Automatic Classification of Security-Relevant Commits , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[9]  Kexin Zhao,et al.  Diffusion dynamics of open source software: An agent-based computational economics (ACE) approach , 2011, Decis. Support Syst..

[10]  Guido Schryen,et al.  Is open source security a myth? , 2011, Commun. ACM.

[11]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[12]  Martin Hell,et al.  Automated CPE Labeling of CVE Summaries with Machine Learning , 2020, DIMVA.

[13]  Yaqin Zhou,et al.  Automated identification of security issues from commit messages and bug reports , 2017, ESEC/SIGSOFT FSE.

[14]  Zhenchang Xing,et al.  Learning to Predict Severity of Software Vulnerability Using Only Vulnerability Description , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[15]  Michele Bezzi,et al.  Commit2Vec: Learning Distributed Representations of Code Changes , 2019, SN Computer Science.

[16]  Serena Elisa Ponta,et al.  Detection, assessment and mitigation of vulnerabilities in open source dependencies , 2020, Empirical Software Engineering.