Labeling Hacker Exploits for Proactive Cyber Threat Intelligence: A Deep Transfer Learning Approach

With the rapid development of new technologies, vulnerabilities are at an all-time high. Companies are investing in developing Cyber Threat Intelligence (CTI) to counteract these new vulnerabilities. However, this CTI is generally reactive based on internal data. Hacker forums can provide proactive CTI value through automated analysis of new trends and exploits. One way to identify exploits is by analyzing the source code that is posted on these forums. These source code snippets are often noisy and unlabeled, making standard data labeling techniques ineffective. This study aims to design a novel framework for the automated collection and categorization of hacker forum exploit source code. We propose a deep transfer learning framework, the Deep Transfer Learning for Exploit Labeling (DTL-EL). DTL-EL leverages the learned representation from professional labeled exploits to better generalize to hacker forum exploits. This model classifies the collected hacker forum exploits into eight predefined categories for proactive and timely CTI. The results of this study indicate that DTL-EL outperforms other prominent models in hacker forum literature.

[1]  Kai Lung Hui,et al.  See No Evil, Hear No Evil? Dissecting the Impact of Online Hacker Forums , 2019, MIS Q..

[2]  Jianfeng Gao,et al.  Deep Learning Based Text Classification: A Comprehensive Review , 2020, ArXiv.

[3]  Peng Zhou,et al.  Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling , 2016, COLING.

[4]  Gang Liu,et al.  Bidirectional LSTM with attention mechanism and convolutional layer for text classification , 2019, Neurocomputing.

[5]  Hsinchun Chen,et al.  Exploring threats and vulnerabilities in hacker web: Forums, IRC and carding shops , 2015, 2015 IEEE International Conference on Intelligence and Security Informatics (ISI).

[6]  Promod Yenigalla,et al.  A Practitioners' Guide to Transfer Learning for Text Classification using Convolutional Neural Networks , 2018, SDM.

[7]  Preslav Nakov,et al.  SemEval-2016 Task 4: Sentiment Analysis in Twitter , 2016, *SEMEVAL.

[8]  Katrin Franke,et al.  Collecting Cyber Threat Intelligence from Hacker Forums via a Two-Stage, Hybrid Process using Support Vector Machines and Latent Dirichlet Allocation , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[9]  Fuzhen Zhuang,et al.  Supervised Representation Learning: Transfer Learning with Deep Autoencoders , 2015, IJCAI.

[10]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[11]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[12]  Vincent Lenders,et al.  BlackWidow: Monitoring the Dark Web for Cyber Security Information , 2019, 2019 11th International Conference on Cyber Conflict (CyCon).

[13]  Patrícia Augustin Jaques,et al.  An Analysis of Hierarchical Text Classification Using Word Embeddings , 2018, Inf. Sci..

[14]  Katrin Franke,et al.  Extracting cyber threat intelligence from hacker forums: Support vector machines versus convolutional neural networks , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[15]  Hsinchun Chen,et al.  Exploring hacker assets in underground forums , 2015, 2015 IEEE International Conference on Intelligence and Security Informatics (ISI).

[16]  Yves Le Traon,et al.  On the Impact of Tokenizer and Parameters on N-Gram Based Code Analysis , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[17]  Hsinchun Chen,et al.  Detecting Cyber Threats in Non-English Dark Net Markets: A Cross-Lingual Transfer Learning Approach , 2018, 2018 IEEE International Conference on Intelligence and Security Informatics (ISI).

[18]  Hsinchun Chen,et al.  Incremental Hacker Forum Exploit Collection and Classification for Proactive Cyber Threat Intelligence: An Exploratory Study , 2018, 2018 IEEE International Conference on Intelligence and Security Informatics (ISI).

[19]  Donghun Lee,et al.  Knowledge of Things: A novel approach to share self-taught knowledge between IoT devices , 2018, 2018 IEEE International Conference on Consumer Electronics (ICCE).

[20]  Abhishek Verma,et al.  Deep CNN-LSTM with combined kernels from multiple branches for IMDb review sentiment analysis , 2017, 2017 IEEE 8th Annual Ubiquitous Computing, Electronics and Mobile Communication Conference (UEMCON).

[21]  Zachary Eberhart,et al.  Adapting Neural Text Classification for Improved Software Categorization , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[22]  Chao Yang,et al.  A Survey on Deep Transfer Learning , 2018, ICANN.

[23]  Jay F. Nunamaker,et al.  Exploring Emerging Hacker Assets and Key Hackers for Proactive Cyber Threat Intelligence , 2017, J. Manag. Inf. Syst..

[24]  Hsinchun Chen,et al.  DICE-E: A Framework for Conducting Darknet Identification, Collection, Evaluation with Ethics , 2019, MIS Q..

[25]  J. Eric Dietz,et al.  Simulation Modeling Cyber Threats, Risks, and Prevention Costs , 2018, 2018 IEEE International Conference on Electro/Information Technology (EIT).

[26]  Hsinchun Chen,et al.  Identifying mobile malware and key threat actors in online hacker forums for proactive cyber threat intelligence , 2017, 2017 IEEE International Conference on Intelligence and Security Informatics (ISI).

[27]  Premkumar T. Devanbu,et al.  On the naturalness of software , 2016, Commun. ACM.