Vulnerability identification and classification via text mining bug databases

As critical and sensitive systems increasingly rely on complex software systems, identifying software vulnerabilities is becoming increasingly important. It has been suggested in previous work that some bugs are only identified as vulnerabilities long after the bug has been made public. These bugs are known as Hidden Impact Bugs (HIBs). This paper presents a hidden impact bug identification methodology by means of text mining bug databases. The presented methodology utilizes the textual description of the bug report for extracting textual information. The text mining process extracts syntactical information of the bug reports and compresses the information for easier manipulation. The compressed information is then utilized to generate a feature vector that is presented to a classifier. The proposed methodology was tested on Linux vulnerabilities that were discovered in the time period from 2006 to 2011. Three different classifiers were tested and 28% to 88% of the hidden impact bugs were identified correctly by using the textual information from the bug descriptions alone. Further analysis of the Bayesian detection rate showed the applicability of the presented method according to the requirements of a development team.

[1]  Bart Goethals,et al.  Predicting the severity of a reported bug , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[2]  Gail E. Kaiser,et al.  BUGMINER: Software Reliability Analysis Via Data Mining of Bug Reports , 2011, SEKE.

[3]  Iulian Neamtiu,et al.  The Journal of Systems and Software 85 (2012) 2275–2292 Contents lists available at SciVerse ScienceDirect The Journal of Systems and Software , 2022 .

[4]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[5]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[6]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[7]  Bojan Cukic,et al.  Detecting bug duplicate reports through local references , 2011, Promise '11.

[8]  Felix FX Lindner,et al.  Vulnerability Extrapolation: Assisted Discovery of Vulnerabilities Using Machine Learning , 2011, WOOT.

[9]  Geoffrey Thomas,et al.  Security Impact Ratings Considered Harmful , 2009, HotOS.

[10]  Per Runeson,et al.  Detection of Duplicate Defect Reports Using Natural Language Processing , 2007, 29th International Conference on Software Engineering (ICSE'07).

[11]  Milos Manic,et al.  Mining Bug Databases for Unidentified Software Vulnerabilities , 2012, 2012 5th International Conference on Human System Interactions.

[12]  Armando Astarloa,et al.  SHA-3 based Message Authentication Codes to secure IEEE 1588 synchronization systems , 2013, IECON 2013 - 39th Annual Conference of the IEEE Industrial Electronics Society.

[13]  Brad A. Myers,et al.  A Linguistic Analysis of How People Describe Software Problems , 2006, Visual Languages and Human-Centric Computing (VL/HCC'06).

[14]  Christian Neureiter,et al.  Towards a framework for engineering smart-grid-specific privacy requirements , 2013, IECON 2013 - 39th Annual Conference of the IEEE Industrial Electronics Society.

[15]  Laurie A. Williams,et al.  One Technique is Not Enough: A Comparison of Vulnerability Discovery Techniques , 2011, 2011 International Symposium on Empirical Software Engineering and Measurement.

[16]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[17]  Stefan Axelsson,et al.  The base-rate fallacy and the difficulty of intrusion detection , 2000, TSEC.

[18]  Wouter Joosen,et al.  Software vulnerability prediction using text analysis techniques , 2012, MetriSec '12.

[19]  Serge Demeyer,et al.  Comparing Mining Algorithms for Predicting the Severity of a Reported Bug , 2011, 2011 15th European Conference on Software Maintenance and Reengineering.

[20]  John Noll,et al.  A Qualitative Study of Open Source Software Development: The Open EMR Project , 2011, 2011 International Symposium on Empirical Software Engineering and Measurement.

[21]  Eric Zamaï,et al.  Confidence estimation of feedback information using dynamic bayesian networks , 2012, IECON 2012 - 38th Annual Conference on IEEE Industrial Electronics Society.

[22]  Armin Wasicek,et al.  Secure channels in an integrated MPSoC architecture , 2013, IECON 2013 - 39th Annual Conference of the IEEE Industrial Electronics Society.

[23]  C. A. Martins,et al.  Reducing the Dimensionality of Bag-of-Words Text Representation Used by Learning Algorithms , 2003 .

[24]  Hareton K. N. Leung,et al.  Mining Static Code Metrics for a Robust Prediction of Software Defect-Proneness , 2011, 2011 International Symposium on Empirical Software Engineering and Measurement.

[25]  Swapna S. Gokhale,et al.  Linux Bugs: Life Cycle and Resolution Analysis , 2008, 2008 The Eighth International Conference on Quality Software.

[26]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[27]  M. McQueen Software and human vulnerabilities , 2010, IECON 2010 - 36th Annual Conference on IEEE Industrial Electronics Society.