NEDetector: Automatically extracting cybersecurity neologisms from hacker forums

Abstract Underground hacker forums serve as an online social platform for hackers to communicate and spread hacking techniques and tools. In these forums, a lot of latest information indirectly or directly affects cyberspace security, thereby threatening the assets of enterprises or individuals. Therefore, social media such as hacker forums and twitter have a great impact on the cybersecurity area. In recent years, analyzing hacker forum data to explore hacking activities and cybersecurity situational awareness have aroused widespread interest among researchers. Automatically identifying cybersecurity words and extracting neologisms from open source social platforms are less successful and still require further research. In order to provide early warning of cyber attack incidents, we proposed NEDetector, a novel method to automatically identify cybersecurity words and extract neologisms from unstructured content, mainly focus on attack groups and hacking tools. NEDetector firstly analyzes the cybersecurity words and proposes four group features to build cybersecurity words identification model based on Bidirectional LSTM algorithm. Secondly, NEDetector introduces 4 sets of features to identify cybersecurity neologisms based on RandomForest algorithm. The experiment result shows that the whole system of NEDetector achieves an identification precision of 89.11%. Furthermore, the proposed extracting neologisms system is often earlier than having enough data in Google Trends when performing predictions on Twitter data, which prove the validity and timeliness of presented system.

[1]  J. R. Quinlan Induction of decision trees , 2004, Machine Learning.

[2]  Éric Gaussier,et al.  A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation , 2005, ECIR.

[3]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[4]  Hans-Jörg Schmid,et al.  The NeoCrawler: identifying and retrieving neologisms from the internet and monitoring ongoing change , 2011 .

[5]  Eric Nichols,et al.  Named Entity Recognition with Bidirectional LSTM-CNNs , 2015, TACL.

[6]  François Chollet,et al.  Keras: The Python Deep Learning library , 2018 .

[7]  Alexandre Allauzen,et al.  Non-lexical neural architecture for fine-grained POS Tagging , 2015, EMNLP.

[8]  Paula Buttery,et al.  Characterizing Eve: Analysing Cybercrime Actors in a Large Underground Forum , 2018, RAID.

[9]  Victor A. Benjamin,et al.  Securing cyberspace: Identifying key actors in hacker communities , 2012, 2012 IEEE International Conference on Intelligence and Security Informatics.

[10]  Robert E. Schapire,et al.  Explaining AdaBoost , 2013, Empirical Inference.

[11]  S. Berg Snowball Sampling—I , 2006 .

[12]  Aasish Pappu,et al.  Unsupervised Neologism Normalization Using Embedding Space Mapping , 2019, EMNLP.

[13]  Isuf Deliu,et al.  Extracting Cyber Threat Intelligence From Hacker Forums , 2017 .

[14]  Guang Liu,et al.  How to Learn Klingon without a Dictionary: Detection and Measurement of Black Keywords Used by the Underground Economy , 2017, 2017 IEEE Symposium on Security and Privacy (SP).

[15]  Hsinchun Chen,et al.  Identifying mobile malware and key threat actors in online hacker forums for proactive cyber threat intelligence , 2017, 2017 IEEE International Conference on Intelligence and Security Informatics (ISI).

[16]  Michael D. Iannacone,et al.  Automatic Labeling for Entity Extraction in Cyber Security , 2013, ArXiv.

[17]  Vern Paxson,et al.  Tools for Automated Analysis of Cybercriminal Markets , 2017, WWW.

[18]  Zhou Li,et al.  Acing the IOC Game: Toward Automatic Discovery and Analysis of Open-Source Cyber Threat Intelligence , 2016, CCS.

[19]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[20]  Richard Frank,et al.  Identifying digital threats in a hacker web forum , 2015, 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[21]  Timothy W. Finin,et al.  CyberTwitter: Using Twitter to generate alerts for cybersecurity threats and vulnerabilities , 2016, 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[22]  Paulo Shakarian,et al.  Mining Key-Hackers on Darkweb Forums , 2018, 2018 1st International Conference on Data Intelligence and Security (ICDIS).

[23]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[24]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[25]  Yuanbo Guo,et al.  A Self-Attention-Based Approach for Named Entity Recognition in Cybersecurity , 2019, 2019 15th International Conference on Computational Intelligence and Security (CIS).

[26]  Andrew Caines,et al.  Automatically identifying the function and intent of posts in underground forums , 2018, Crime science.

[27]  Sadia Afroz,et al.  Towards Automatic Discovery of Cybercrime Supply Chains , 2018, ArXiv.

[28]  Ravendar Lal,et al.  Information Extraction of Security related entities and concepts from unstructured text. , 2013 .

[29]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[30]  Vladimir Vapnik,et al.  Support-vector networks , 2004, Machine Learning.

[31]  Wu Hong-lin,et al.  Research on neologism detection in entity attribute knowledge acquisition , 2017 .

[32]  Timothy W. Finin,et al.  Extracting Cybersecurity Related Linked Data from Text , 2013, 2013 IEEE Seventh International Conference on Semantic Computing.

[33]  Zachary Chase Lipton A Critical Review of Recurrent Neural Networks for Sequence Learning , 2015, ArXiv.

[34]  Yanfang Ye,et al.  Key Player Identification in Underground Forums over Attributed Heterogeneous Information Network Embedding Framework , 2019, CIKM.

[35]  Hsinchun Chen,et al.  Exploring hacker assets in underground forums , 2015, 2015 IEEE International Conference on Intelligence and Security Informatics (ISI).

[36]  Huan Liu,et al.  Understanding Cyber Attack Behaviors with Sentiment Information on Social Media , 2018, SBP-BRiMS.

[37]  Hsinchun Chen,et al.  Chinese underground market jargon analysis based on unsupervised learning , 2016, 2016 IEEE Conference on Intelligence and Security Informatics (ISI).

[38]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[39]  Pooja Kamat,et al.  Hacker Forum Exploit and Classification for Proactive Cyber Threat Intelligence , 2019, Inventive Computation Technologies.

[40]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[41]  Zhen Fang,et al.  Exploring key hackers and cybersecurity threats in Chinese hacker communities , 2016, 2016 IEEE Conference on Intelligence and Security Informatics (ISI).

[42]  Hsinchun Chen,et al.  Exploring threats and vulnerabilities in hacker web: Forums, IRC and carding shops , 2015, 2015 IEEE International Conference on Intelligence and Security Informatics (ISI).

[43]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[44]  Hsinchun Chen,et al.  AZSecure Hacker Assets Portal: Cyber threat intelligence and malware analysis , 2016, 2016 IEEE Conference on Intelligence and Security Informatics (ISI).

[45]  Runjie Zhu,et al.  Utilizing BERT for biomedical and clinical text mining , 2021 .

[46]  Jiabin Wang,et al.  KADetector: Automatic Identification of Key Actors in Online Hack Forums Based on Structured Heterogeneous Information Network , 2018, 2018 IEEE International Conference on Big Knowledge (ICBK).

[47]  Ahmad Diab,et al.  Darknet and deepnet mining for proactive cybersecurity threat intelligence , 2016, 2016 IEEE Conference on Intelligence and Security Informatics (ISI).