FastEmbed: Predicting vulnerability exploitation possibility based on ensemble machine learning algorithm

In recent years, the number of vulnerabilities discovered and publicly disclosed has shown a sharp upward trend. However, the value of exploitation of vulnerabilities varies for attackers, considering that only a small fraction of vulnerabilities are exploited. Therefore, the realization of quick exclusion of the non-exploitable vulnerabilities and optimal patch prioritization on limited resources has become imperative for organizations. Recent works using machine learning techniques predict exploited vulnerabilities by extracting features from open-source intelligence (OSINT). However, in the face of explosive growth of vulnerability information, there is room for improvement in the application of past methods to multiple threat intelligence. A more general method is needed to deal with various threat intelligence sources. Moreover, in previous methods, traditional text processing methods were used to deal with vulnerability related descriptions, which only grasped the static statistical characteristics but ignored the context and the meaning of the words of the text. To address these challenges, we propose an exploit prediction model, which is based on a combination of fastText and LightGBM algorithm and called fastEmbed. We replicate key portions of the state-of-the-art work of exploit prediction and use them as benchmark models. Our model outperforms the baseline model whether in terms of the generalization ability or the prediction ability without temporal intermixing with an average overall improvement of 6.283% by learning the embedding of vulnerability-related text on extremely imbalanced data sets. Besides, in terms of predicting the exploits in the wild, our model also outperforms the baseline model with an F1 measure of 0.586 on the minority class (33.577% improvement over the work using features from darkweb/deepweb). The results demonstrate that the model can improve the ability to describe the exploitability of vulnerabilities and predict exploits in the wild effectively.

[1]  Gerhard Widmer,et al.  Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[2]  Karen A. Scarfone,et al.  SP 800-117. Guide to Adopting and Using the Security Content Automation Protocol (SCAP) Version 1.0 , 2010 .

[3]  Shlomo Shamai,et al.  Mutual information and minimum mean-square error in Gaussian channels , 2004, IEEE Transactions on Information Theory.

[4]  Nick Feamster,et al.  PREDATOR: Proactive Recognition and Elimination of Domain Abuse at Time-Of-Registration , 2016, CCS.

[5]  J. D. de Winter Using the Student ’ s t-test with extremely small sample sizes , 2013 .

[6]  Josef Horalek,et al.  Effective penetration testing with Metasploit framework and methodologies , 2014, 2014 IEEE 15th International Symposium on Computational Intelligence and Informatics (CINTI).

[7]  Charles P. Pfleeger,et al.  Security in computing , 1988 .

[8]  Nicolas Christin,et al.  Automatically Detecting Vulnerable Websites Before They Turn Malicious , 2014, USENIX Security Symposium.

[9]  Luis Gustavo Araujo Rodriguez,et al.  Analysis of Vulnerability Disclosure Delays from the National Vulnerability Database , 2018 .

[10]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[11]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[12]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[13]  Juan Enrique Ramos,et al.  Using TF-IDF to Determine Word Relevance in Document Queries , 2003 .

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  Leyla Bilge,et al.  Before we knew it: an empirical study of zero-day attacks in the real world , 2012, CCS.

[16]  Richard Frank,et al.  Identifying digital threats in a hacker web forum , 2015, 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[17]  Karen A. Scarfone,et al.  Guide to Adopting and Using the Security Content Automation Protocol (SCAP) Version 1.0 , 2010 .

[18]  Haralambos Mouratidis,et al.  From product recommendation to cyber-attack prediction: generating attack graphs and predicting future attacks , 2018, Evol. Syst..

[19]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[20]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[21]  Fabio Massacci,et al.  A preliminary analysis of vulnerability scores for attacks in wild: the ekits and sym datasets , 2012, BADGERS@CCS.

[22]  Fabio Massacci,et al.  Quantitative Assessment of Risk Reduction with Cybercrime Black Market Monitoring , 2013, 2013 IEEE Security and Privacy Workshops.

[23]  Jesse M. Ehrenfeld WannaCry, Cybersecurity and Health Information Technology: A Time to Act , 2017, Journal of Medical Systems.

[24]  Mehran Bozorgi,et al.  Beyond heuristics: learning to classify vulnerabilities and predict exploits , 2010, KDD.

[25]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[26]  Doina Caragea,et al.  An Empirical Study on Using the National Vulnerability Database to Predict Software Vulnerabilities , 2011, DEXA.

[27]  Alan Said,et al.  Predicting Vulnerability Exploits in the Wild , 2015, 2015 IEEE 2nd International Conference on Cyber Security and Cloud Computing.

[28]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[29]  Paulo Shakarian,et al.  Predicting Cyber Threats through Hacker Social Networks in Darkweb and Deepweb Forums , 2017 .

[30]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[31]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[32]  Mirko Sailio,et al.  Vulnerability database analysis for 10 years for ensuring security of cyber critical green infrastructures , 2015, AFRICON 2015.

[33]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[34]  Karen Scarfone,et al.  An analysis of CVSS version 2 vulnerability scoring , 2009, 2009 3rd International Symposium on Empirical Software Engineering and Measurement.

[35]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[36]  Paulo Shakarian,et al.  Proactive identification of exploits in the wild through vulnerability mentions online , 2017, 2017 International Conference on Cyber Conflict (CyCon U.S.).

[37]  William Stafford Noble,et al.  Support vector machine , 2013 .

[38]  Magnus Almgren,et al.  Data Modelling for Predicting Exploits , 2018, NordSec.

[39]  Matthew Roughan,et al.  The Effect of Common Vulnerability Scoring System Metrics on Vulnerability Exploit Delay , 2018, 2018 Sixth International Symposium on Computing and Networking (CANDAR).

[40]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[41]  Parinaz Naghizadeh Ardabili,et al.  Cloudy with a Chance of Breach: Forecasting Cyber Security Incidents , 2015, USENIX Security Symposium.

[42]  Bernhard Plattner,et al.  Modelling the Security Ecosystem- The Dynamics of (In)Security , 2009, WEIS.

[43]  Tudor Dumitras,et al.  Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits , 2015, USENIX Security Symposium.

[44]  Paulo Shakarian,et al.  Patch Before Exploited: An Approach to Identify Targeted Software Vulnerabilities , 2018, AI in Cybersecurity.

[45]  Alan Said,et al.  Predicting Cyber Vulnerability Exploits with Machine Learning , 2015, Scandinavian Conference on AI.

[46]  Tudor Dumitras,et al.  Some Vulnerabilities Are Different Than Others - Studying Vulnerabilities and Attack Surfaces in the Wild , 2014, RAID.

[47]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[48]  Christopher L. Smith,et al.  Predicting Exploitation of Disclosed Software Vulnerabilities Using Open-source Data , 2017, IWSPA@CODASPY.

[49]  Pirawat Watanapongse,et al.  Time-related vulnerability lookahead extension to the CVE , 2016, 2016 13th International Joint Conference on Computer Science and Software Engineering (JCSSE).

[50]  Paulo Shakarian,et al.  DarkEmbed: Exploit Prediction With Neural Language Models , 2018, AAAI.

[51]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[52]  Fabio Massacci,et al.  Comparing Vulnerability Severity and Exploits Using Case-Control Studies , 2014, TSEC.