DarkEmbed: Exploit Prediction With Neural Language Models

Software vulnerabilities can expose computer systems to attacks by malicious actors. With the number of vulnerabilities discovered in the recent years surging, creating timely patches for every vulnerability is not always feasible. At the same time, not every vulnerability will be exploited by attackers; hence, prioritizing vulnerabilities by assessing the likelihood they will be exploited has become an important research problem. Recent works used machine learning techniques to predict exploited vulnerabilities by analyzing discussions about vulnerabilities on social media. These methods relied on traditional text processing techniques, which represent statistical features of words, but fail to capture their context. To address this challenge, we propose DarkEmbed, a neural language modeling approach that learns low dimensional distributed representations, i.e., embeddings, of darkweb/deepweb discussions to predict whether vulnerabilities will be exploited. By capturing linguistic regularities of human language, such as syntactic, semantic similarity and logic analogy, the learned embeddings are better able to classify discussions about exploited vulnerabilities than traditional text analysis methods. Evaluations demonstrate the efficacy of learned embeddings on both structured text (such as security blog posts) and unstructured text (darkweb/deepweb posts). DarkEmbed outperforms state-of-the-art approaches on the exploit prediction task with an F1-score of 0.74.

[1]  Karen A. Scarfone,et al.  SP 800-117. Guide to Adopting and Using the Security Content Automation Protocol (SCAP) Version 1.0 , 2010 .

[2]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[3]  Nick Feamster,et al.  PREDATOR: Proactive Recognition and Elimination of Domain Abuse at Time-Of-Registration , 2016, CCS.

[4]  Alan Said,et al.  Predicting Cyber Vulnerability Exploits with Machine Learning , 2015, Scandinavian Conference on AI.

[5]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[6]  Ahmad Diab,et al.  Product offerings in malicious hacker markets , 2016, 2016 IEEE Conference on Intelligence and Security Informatics (ISI).

[7]  Geoffrey E. Hinton,et al.  Three new graphical models for statistical language modelling , 2007, ICML '07.

[8]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[9]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[10]  Hsinchun Chen,et al.  AZSecure Hacker Assets Portal: Cyber threat intelligence and malware analysis , 2016, 2016 IEEE Conference on Intelligence and Security Informatics (ISI).

[11]  Paulo Shakarian,et al.  Exploring Malicious Hacker Forums , 2016, Cyber Deception.

[12]  Karen A. Scarfone,et al.  An analysis of CVSS version 2 vulnerability scoring , 2009, ESEM 2009.

[13]  Parinaz Naghizadeh Ardabili,et al.  Cloudy with a Chance of Breach: Forecasting Cyber Security Incidents , 2015, USENIX Security Symposium.

[14]  Mehran Bozorgi,et al.  Beyond heuristics: learning to classify vulnerabilities and predict exploits , 2010, KDD.

[15]  Nicolas Christin,et al.  Automatically Detecting Vulnerable Websites Before They Turn Malicious , 2014, USENIX Security Symposium.

[16]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[17]  Ahmad Diab,et al.  Darkweb Cyber Threat Intelligence Mining , 2017 .

[18]  Timothy W. Finin,et al.  CyberTwitter: Using Twitter to generate alerts for cybersecurity threats and vulnerabilities , 2016, 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[19]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[20]  Ahmad Diab,et al.  Darknet and deepnet mining for proactive cybersecurity threat intelligence , 2016, 2016 IEEE Conference on Intelligence and Security Informatics (ISI).

[21]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[22]  James Kinross,et al.  Effective cybersecurity is fundamental to patient safety , 2017, British Medical Journal.

[23]  J.A. Anderson,et al.  Neurocomputing: Foundations of Research@@@Neurocomputing 2: Directions for Research , 1992 .

[24]  Tudor Dumitras,et al.  Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits , 2015, USENIX Security Symposium.