Finding Malicious Cyber Discussions in Social Media

Abstract : AbstractSecurity analysts gather essential information oncyber attacks, exploits, vulnerabilities, and victimsby manually searching social media sites. This effortcan be dramatically reduced using natural languagemachine learning techniques. Using a newEnglish text corpus containing more than 250k discussionsfrom Stack Exchange, Reddit, and Twitteron cyber and non-cyber topics, we demonstrate theability to detect more than 90% of the cyber discussionswith fewer than 1% false alarms. If an originalsearched document corpus includes only 5%cyber documents, then our processing provides anenriched corpus for analysts where 83% to 95% ofthe documents are on cyber topics. Good performancewas obtained using TF-IDF features and logisticregression. A classifier trained using priorhistorical data accurately detected 86% of emergentHeartbleed discussions and retrospective experimentsdemonstrate that classifier performanceremains stable up to a year without retraining.

[1]  Sergio Caltagirone,et al.  The Diamond Model of Intrusion Analysis , 2013 .

[2]  Judith L. Klavans Cybersecurity - What's Language got to do with it? , 2015 .

[3]  Ankur Padia,et al.  UCO: A Unified Cybersecurity Ontology , 2016, AAAI Workshop: Artificial Intelligence for Cyber Security.

[4]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[5]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[6]  Tudor Dumitras,et al.  Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits , 2015, USENIX Security Symposium.

[7]  Foster J. Provost,et al.  Classification in Networked Data: a Toolkit and a Univariate Case Study , 2007, J. Mach. Learn. Res..

[8]  Raymond Y. K. Lau,et al.  A Probabilistic Generative Model for Mining Cybercriminal Networks from Online Social Media , 2014, IEEE Computational Intelligence Magazine.

[9]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[10]  Peter Kulchyski and , 2015 .

[11]  W. M. Campbell,et al.  Content + Context Networks for User Classification in Twitter ∗ , 2013 .

[12]  Ross J. Anderson Security engineering - a guide to building dependable distributed systems (2. ed.) , 2001 .

[13]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[14]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[15]  Tommi S. Jaakkola,et al.  Partially labeled classification with Markov random walks , 2001, NIPS.

[16]  Dock Bumpers,et al.  Volume 2 , 2005, Proceedings of the Ninth International Conference on Computer Supported Cooperative Work in Design, 2005..

[17]  Andrew McCallum,et al.  Introduction to Statistical Relational Learning , 2007 .

[18]  Jennifer Neville,et al.  Collective Classification with Relational Dependency Networks , 2003 .

[19]  Timothy W. Finin,et al.  Extracting Information about Security Vulnerabilities from Web Text , 2011, 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[20]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[21]  Robert A. Bridges,et al.  Towards a Relation Extraction Framework for Cyber-Security Concepts , 2015, CISR.