A Machine Learning Approach for Vulnerability Curation

Software composition analysis depends on database of open-source library vulerabilities, curated by security researchers using various sources, such as bug tracking systems, commits, and mailing lists. We report the design and implementation of a machine learning system to help the curation by by automatically predicting the vulnerability-relatedness of each data item. It supports a complete pipeline from data collection, model training and prediction, to the validation of new models before deployment. It is executed iteratively to generate better models as new input data become available. We use self-training to significantly and automatically increase the size of the training dataset, opportunistically maximizing the improvement in the models' quality at each iteration. We devised new deployment stability metric to evaluate the quality of the new models before deployment into production, which helped to discover an error. We experimentally evaluate the improvement in the performance of the models in one iteration, with 27.59% maximum PR AUC improvements. Ours is the first of such study across a variety of data sources. We discover that the addition of the features of the corresponding commits to the features of issues/pull requests improve the precision for the recall values that matter. We demonstrate the effectiveness of self-training alone, with 10.50% PR AUC improvement, and we discover that there is no uniform ordering of word2vec parameters sensitivity across data sources.

[1]  Xin Yao,et al.  Using Class Imbalance Learning for Software Defect Prediction , 2013, IEEE Transactions on Reliability.

[2]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[3]  Huanhuan Chen,et al.  Negative correlation learning for classification ensembles , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[4]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[5]  Jason Yeo,et al.  The Dynamics of Software Composition Analysis , 2019, ArXiv.

[6]  Andrew Meneely,et al.  When a Patch Goes Bad: Exploring the Properties of Vulnerability-Contributing Commits , 2013, 2013 ACM / IEEE International Symposium on Empirical Software Engineering and Measurement.

[7]  Liuyang Wan Automated vulnerability detection system based on commit messages , 2019 .

[8]  Matthew Smith,et al.  VCCFinder: Finding Potential Vulnerabilities in Open-Source Projects to Assist Code Audits , 2015, CCS.

[9]  Guillermo L. Grinblat,et al.  Toward Large-Scale Vulnerability Discovery using Machine Learning , 2016, CODASPY.

[10]  José Javier Dolado,et al.  Preliminary comparison of techniques for dealing with imbalance in software defect prediction , 2014, EASE '14.

[11]  Martial Hebert,et al.  Semi-Supervised Self-Training of Object Detection Models , 2005, 2005 Seventh IEEE Workshops on Applications of Computer Vision (WACV/MOTION'05) - Volume 1.

[12]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[13]  David Lo,et al.  Identifying Linux bug fixing patches , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[14]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[15]  Anna Veronika Dorogush,et al.  CatBoost: gradient boosting with categorical features support , 2018, ArXiv.

[16]  Rayid Ghani,et al.  Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.

[17]  Hamid Reza Shahriari,et al.  Software Vulnerability Analysis and Discovery Using Machine-Learning and Data-Mining Techniques , 2017, ACM Comput. Surv..

[18]  André L. V. Coelho,et al.  Classification with Imbalanced Data , 2015 .

[19]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[20]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[21]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[22]  Milos Manic,et al.  Mining Bug Databases for Unidentified Software Vulnerabilities , 2012, 2012 5th International Conference on Human System Interactions.

[23]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[24]  Yves Le Traon,et al.  The importance of accounting for real-world labelling when predicting software vulnerabilities , 2019, ESEC/SIGSOFT FSE.

[25]  Sebastian Thrun,et al.  Learning to Classify Text from Labeled and Unlabeled Documents , 1998, AAAI/IAAI.

[26]  Xiaojin Zhu,et al.  Semi-Supervised Learning , 2010, Encyclopedia of Machine Learning.

[27]  Laurie A. Williams,et al.  Approximating Attack Surfaces with Stack Traces , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[28]  Yaqin Zhou,et al.  Automated identification of security issues from commit messages and bug reports , 2017, ESEC/SIGSOFT FSE.

[29]  Michele Bezzi,et al.  A Practical Approach to the Automatic Classification of Security-Relevant Commits , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[30]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[31]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[32]  Gary McGraw,et al.  ITS4: a static vulnerability scanner for C and C++ code , 2000, Proceedings 16th Annual Computer Security Applications Conference (ACSAC'00).

[33]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[34]  Padmanabhan Krishnan,et al.  Machine learning for finding bugs: An initial report , 2017, 2017 IEEE Workshop on Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE).

[35]  Laurie A. Williams,et al.  Risk-Based Attack Surface Approximation: How Much Data Is Enough? , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP).

[36]  Felix FX Lindner,et al.  Vulnerability Extrapolation: Assisted Discovery of Vulnerabilities Using Machine Learning , 2011, WOOT.

[37]  Yang Chen,et al.  Automated Identification of Libraries from Vulnerability Data , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[38]  ShangJennifer,et al.  Learning from class-imbalanced data , 2017 .

[39]  Yijing Li,et al.  Learning from class-imbalanced data: Review of methods and applications , 2017, Expert Syst. Appl..

[40]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .