[Engineering Paper] Enabling the Continuous Analysis of Security Vulnerabilities with VulData7

Studies on security vulnerabilities require the analysis, investigation and comprehension of real vulnerable code instances. However, collecting and experimenting with a sufficient number of such instances is challenging. To cope with this issue, we developed VulData7, an extensible framework and dataset of real vulnerabilities, automatically collected from software archives. The current version of the dataset contains all reported vulnerabilities (in the NVD database) of 4 security critical open source systems, i.e., Linux Kernel, WireShark, OpenSSL, SystemD. For each vulnerability, VulData7 provides the vulnerability report data (description, CVE number, CWE number, CVSS severity score and others), the vulnerable code instance (list of versions), and when available its corresponding patches (list of fixing commits) and the files (before and after fix). VulData7 is automated, flexible and easily extensible. Once configured, it extracts and links information from the related software archives (through Git and NVD reports) to create a dataset that is continuously updated with the latest information available. Currently, VulData7 retrieves fixes for 1,600 out of the 2,800 reported vulnerabilities of the 4 systems. The framework also supports the collection of additional software defects and aims at easing empirical studies and analyses. We believe that our framework is a valuable resource for both developers and researchers interested in secure software development. Vul-Data7 can also serve educational purposes and trigger research on source code analysis. VulData7 is publicly available at: https://github.com/electricalwind/data7

[1]  Gary McGraw,et al.  Software Security Testing , 2004, IEEE Secur. Priv..

[2]  Tracy Hall,et al.  A Systematic Literature Review on Fault Prediction Performance in Software Engineering , 2012, IEEE Transactions on Software Engineering.

[3]  Richard Lippmann,et al.  Testing static analysis tools using exploitable buffer overflows from open source code , 2004, SIGSOFT '04/FSE-12.

[4]  Yves Le Traon,et al.  An Empirical Analysis of Vulnerabilities in OpenSSL and the Linux Kernel , 2016, 2016 23rd Asia-Pacific Software Engineering Conference (APSEC).

[5]  William K. Robertson,et al.  LAVA: Large-Scale Automated Vulnerability Addition , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[6]  Michael Howard,et al.  The security development lifecycle : SDL, a process for developing demonstrably more secure software , 2006 .

[7]  Gary McGraw Automated Code Review Tools for Security , 2008 .

[8]  Yuanyuan Zhou,et al.  Bug characteristics in open source software , 2013, Empirical Software Engineering.

[9]  Laurie A. Williams,et al.  Evaluating Complexity, Code Churn, and Developer Activity Metrics as Indicators of Software Vulnerabilities , 2011, IEEE Transactions on Software Engineering.

[10]  Yves Le Traon,et al.  On the Impact of Tokenizer and Parameters on N-Gram Based Code Analysis , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[11]  Tracy Hall,et al.  Reproducibility and replicability of software defect prediction studies , 2018, Inf. Softw. Technol..

[12]  Yves Le Traon,et al.  Vulnerability Prediction Models: A Case Study on the Linux Kernel , 2016, 2016 IEEE 16th International Working Conference on Source Code Analysis and Manipulation (SCAM).