Data-Driven Cybersecurity Incident Prediction: A Survey

Driven by the increasing scale and high profile cybersecurity incidents related public data, recent years we have witnessed a paradigm shift in understanding and defending against the evolving cyber threats, from primarily reactive detection toward proactive prediction. Meanwhile, governments, businesses, and individual Internet users show the growing public appetite to improve cyber resilience that refers to their ability to prepare for, combat and recover from cyber threats and incidents. Undoubtedly, predicting cybersecurity incidents is deemed to have excellent potential for proactively advancing cyber resilience. Research communities and industries have begun proposing cybersecurity incident prediction schemes by utilizing different types of data sources, including organization’s reports and datasets, network data, synthetic data, data crawled from webpages, and data retrieved from social media. With a focus on the dataset, this survey paper investigates the emerging research by reviewing recent representative works appeared in the dominant period. We also extract and summarize the data-driven research methodology commonly adopted in this fast-growing area. In consonance with the phases of the methodology, each work that predicts cybersecurity incident is comprehensively studied. Challenges and future directions in this field are also discussed.

[1]  Tom M. Mitchell,et al.  Weakly Supervised Extraction of Computer Security Events from Twitter , 2015, WWW.

[2]  Qiang Fu,et al.  Multistage and Elastic Spam Detection in Mobile Social Networks through Deep Learning , 2018, IEEE Network.

[3]  Julian Jang,et al.  A survey of emerging threats in cybersecurity , 2014, J. Comput. Syst. Sci..

[4]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[5]  John D. Howard,et al.  An analysis of security incidents on the Internet 1989-1995 , 1998 .

[6]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[7]  Jie Wu,et al.  Robust Network Traffic Classification , 2015, IEEE/ACM Transactions on Networking.

[8]  Erhan Guven,et al.  A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection , 2016, IEEE Communications Surveys & Tutorials.

[9]  Thomas A. Longstaff,et al.  A common language for computer security incidents , 1998 .

[10]  Gabriel Maciá-Fernández,et al.  Anomaly-based network intrusion detection: Techniques, systems and challenges , 2009, Comput. Secur..

[11]  Gerardo Hermosillo,et al.  Learning From Crowds , 2010, J. Mach. Learn. Res..

[12]  Kevin Jones,et al.  Early Stage Malware Prediction Using Recurrent Neural Networks , 2017, Comput. Secur..

[13]  Christopher Krügel,et al.  Delta: automatic identification of unknown web-based infection campaigns , 2013, CCS.

[14]  Lei Cen,et al.  AUTOREB: Automatically Understanding the Review-to-Behavior Fidelity in Android Applications , 2015, CCS.

[15]  Mianxiong Dong,et al.  ActiveTrust: Secure and Trustable Routing in Wireless Sensor Networks , 2016, IEEE Transactions on Information Forensics and Security.

[16]  Robin M. Ruefle,et al.  State of the Practice of Computer Security Incident Response Teams (CSIRTs) , 2003 .

[17]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[18]  Alberto Maria Segre,et al.  The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1 Pandemic , 2011, PloS one.

[19]  Nicolas Christin,et al.  Automatically Detecting Vulnerable Websites Before They Turn Malicious , 2014, USENIX Security Symposium.

[20]  W. Bruce Croft,et al.  Quary Expansion Using Local and Global Document Analysis , 1996, SIGIR Forum.

[21]  Alexander Pretschner,et al.  Code obfuscation against symbolic execution attacks , 2016, ACSAC.

[22]  Jun Zhang,et al.  Automated Big Traffic Analytics for Cyber Security , 2018, ArXiv.

[23]  Jin Li,et al.  Securely Outsourcing Attribute-Based Encryption with Checkability , 2014, IEEE Transactions on Parallel and Distributed Systems.

[24]  A. Pplications Knowledge Discovery in Texts: a Definition, and Applications , 1999 .

[25]  Thomas S. Richardson,et al.  Causal Inference in the Presence of Latent Variables and Selection Bias , 1995, UAI.

[26]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[27]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[28]  Stefan Katzenbeisser,et al.  Protecting Software through Obfuscation , 2016, ACM Comput. Surv..

[29]  Mohammad Zulkernine,et al.  Using complexity, coupling, and cohesion metrics as early indicators of vulnerabilities , 2011, J. Syst. Archit..

[30]  David Barr,et al.  Common DNS Operational and Configuration Errors , 1996, RFC.

[31]  Ratul Mahajan,et al.  Understanding BGP misconfiguration , 2002, SIGCOMM 2002.

[32]  Jean C. Walrand,et al.  How Bad Are Selfish Investments in Network Security? , 2011, IEEE/ACM Transactions on Networking.

[33]  Guillermo Sapiro,et al.  Supervised Dictionary Learning , 2008, NIPS.

[34]  Leyla Bilge,et al.  RiskTeller: Predicting the Risk of Cyber Incidents , 2017, CCS.

[35]  Dawson R. Engler,et al.  KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs , 2008, OSDI.

[36]  Ray R. Larson Introduction to Information Retrieval , 2010 .

[37]  Guang Liu,et al.  How to Learn Klingon without a Dictionary: Detection and Measurement of Black Keywords Used by the Underground Economy , 2017, 2017 IEEE Symposium on Security and Privacy (SP).

[38]  Stefan Axelsson,et al.  The base-rate fallacy and its implications for the difficulty of intrusion detection , 1999, CCS '99.

[39]  Janis Stirna,et al.  Cyber Resilience - Fundamentals for a Definition , 2015, WorldCIST.

[40]  Mianxiong Dong,et al.  A Hierarchical Security Framework for Defending Against Sophisticated Attacks on Wireless Sensor Networks in Smart Cities , 2016, IEEE Access.

[41]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Clive Blackwell,et al.  A security ontology for incident analysis , 2010, CSIIRW '10.

[43]  Eric Wustrow,et al.  ZMap: Fast Internet-wide Scanning and Its Security Applications , 2013, USENIX Security Symposium.

[44]  Leyla Bilge,et al.  The Dropper Effect: Insights into Malware Distribution with Downloader Graph Analytics , 2015, CCS.

[45]  Gang Wang,et al.  Crowdsourcing Cybersecurity: Cyber Attack Detection using Social Media , 2017, CIKM.

[46]  Paul Geladi,et al.  Principal Component Analysis , 1987, Comprehensive Chemometrics.

[47]  Christian S. Collberg,et al.  Distributed application tamper detection via continuous software updates , 2012, ACSAC '12.

[48]  Christian S. Collberg,et al.  A Taxonomy of Obfuscating Transformations , 1997 .

[49]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[50]  Heng Yin,et al.  Dark Hazard: Learning-based, Large-Scale Discovery of Hidden Sensitive Operations in Android Apps , 2017, NDSS.

[51]  Parinaz Naghizadeh Ardabili,et al.  Prioritizing Security Spending: A Quantitative Analysis of Risk Distributions for Different Business Profiles , 2015, WEIS.

[52]  Tyler Moore,et al.  Measuring and Analyzing Search-Redirection Attacks in the Illicit Online Prescription Drug Trade , 2011, USENIX Security Symposium.

[53]  Bill Cheswick,et al.  Firewalls and internet security - repelling the wily hacker , 2003, Addison-Wesley professional computing series.

[54]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[55]  Alexander Pretschner,et al.  Predicting the Resilience of Obfuscated Code Against Symbolic Execution Attacks via Machine Learning , 2017, USENIX Security Symposium.

[56]  Taghi M. Khoshgoftaar,et al.  Deep learning applications and challenges in big data analytics , 2015, Journal of Big Data.

[57]  Jin Li,et al.  Secure attribute-based data sharing for resource-limited users in cloud computing , 2018, Comput. Secur..

[58]  Carey L. Williamson,et al.  Offline/realtime traffic classification using semi-supervised learning , 2007, Perform. Evaluation.

[59]  Krishna P. Gummadi,et al.  On the evolution of user interaction in Facebook , 2009, WOSN '09.

[60]  Luca Salgarelli,et al.  Support Vector Machines for TCP traffic classification , 2009, Comput. Networks.

[61]  Zhou Li,et al.  Acing the IOC Game: Toward Automatic Discovery and Analysis of Open-Source Cyber Threat Intelligence , 2016, CCS.

[62]  Wenke Lee,et al.  SURF: detecting and measuring search poisoning , 2011, CCS '11.

[63]  Jun Zhang,et al.  Network Traffic Classification Using Correlation Information , 2013, IEEE Transactions on Parallel and Distributed Systems.

[64]  Tudor Dumitras,et al.  Toward a standard benchmark for computer security research: the worldwide intelligence network environment (WINE) , 2011, BADGERS '11.

[65]  Jun Zhang,et al.  Detecting and Preventing Cyber Insider Threats: A Survey , 2018, IEEE Communications Surveys & Tutorials.

[66]  Karen A. Scarfone,et al.  Computer Security Incident Handling Guide , 2004 .

[67]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[68]  Parinaz Naghizadeh Ardabili,et al.  Cloudy with a Chance of Breach: Forecasting Cyber Security Incidents , 2015, USENIX Security Symposium.

[69]  Ryan A. Rossi,et al.  Modeling dynamic behavior in large evolving graphs , 2013, WSDM.

[70]  Rohit J. Kate A Dependency-based Word Subsequence Kernel , 2008, EMNLP.

[71]  William Stallings Network and Internetwork Security: Principles and Practice , 1994 .

[72]  Christos Faloutsos,et al.  It's who you know: graph mining using recursive structural features , 2011, KDD.

[73]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[74]  Konrad Rieck,et al.  Chucky: exposing missing checks in source code for vulnerability discovery , 2013, CCS.

[75]  Mingyan Liu,et al.  On the Mismanagement and Maliciousness of Networks , 2014, NDSS.

[76]  Frederick B. Cohen,et al.  Protection and Security on the Information Superhighway , 1995 .

[77]  Mianxiong Dong,et al.  When Weather Matters: IoT-Based Electrical Load Forecasting for Smart Grid , 2017, IEEE Communications Magazine.

[78]  Wazir Zada Khan,et al.  Mobile Phone Sensing Systems: A Survey , 2013, IEEE Communications Surveys & Tutorials.

[79]  Tudor Dumitras,et al.  Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits , 2015, USENIX Security Symposium.

[80]  Mianxiong Dong,et al.  Learning IoT in Edge: Deep Learning for the Internet of Things with Edge Computing , 2018, IEEE Network.

[81]  Christopher Krügel,et al.  Firmalice - Automatic Detection of Authentication Bypass Vulnerabilities in Binary Firmware , 2015, NDSS.

[82]  Isabell M. Welpe,et al.  Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment , 2010, ICWSM.

[83]  Wei Luo,et al.  Cross-Project Transfer Representation Learning for Vulnerable Function Discovery , 2018, IEEE Transactions on Industrial Informatics.

[84]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[85]  Hans-Peter Kriegel,et al.  OPTICS-OF: Identifying Local Outliers , 1999, PKDD.

[86]  Kenta Oono,et al.  Chainer : a Next-Generation Open Source Framework for Deep Learning , 2015 .