iDetector: Automate Underground Forum Analysis Based on Heterogeneous Information Network

Online underground forums have been widely used by cybercriminals to trade the illicit products, resources and services, which have played a central role in the cybercrim-inal ecosystem. Unfortunately, due to the number of forums, their size, and the expertise required, it's infeasible to perform manual exploration to understand their behavioral processes. In this paper, we propose a novel framework named iDetector to automate the analysis of underground forums for the detection of cybercrime-suspected threads. In iDetector, to detect whether the given threads are cybercrime-suspected threads, we not only analyze the content in the threads, but also utilize the relations among threads, users, replies, and topics. To model this kind of rich semantic relationships (i.e., thread-user, thread-reply, thread-topic, reply-user and reply-topic relations), we introduce a structured heterogeneous information network (HIN) for representation, which is capable to be composed of different types of entities and relations. To capture the complex relationships (e.g., two threads are relevant if they were posted by the same user and discussed the same topic), we use a meta-structure based approach to characterize the semantic relatedness over threads. As different meta-structures depict the relatedness over threads at different views, we then build a classifier using Laplacian scores to aggregate different similarities formulated by different meta-structures to make predictions. To the best of our knowledge, this is the first work to use structural HIN to automate underground forum analysis. Comprehensive experiments on real data collections from underground forums (e.g., Hack Forums) are conducted to validate the effectiveness of our developed system iDetector in cybercrime-suspected thread detection by comparisons with other alternative methods.

[1]  Charles R. Johnson,et al.  Topics in matrix analysis: The Hadamard product , 1991 .

[2]  Xin Li,et al.  Social Media for Opioid Addiction Epidemiology: Automatic Detection of Opioid Addicts from Twitter and Case Studies , 2017, CIKM.

[3]  Jiawei Han,et al.  Text Classification with Heterogeneous Information Network Kernels , 2016, AAAI.

[4]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[5]  Hai-Xin Duan,et al.  Seeking Nonsense, Looking for Trouble: Efficient Promotional-Infection Detection through Semantic Inconsistency Search , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[6]  Howard Rush,et al.  The cybercrime ecosystem: Online innovation in the shadows? , 2013 .

[7]  Pierre Baldi,et al.  A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes , 2001, Bioinform..

[8]  Yu Hong,et al.  Pivot to Internet Plus: Molding China's Digital Economy for Economic Restructuring? , 2017 .

[9]  Xin Li,et al.  DeepAM: a heterogeneous deep learning framework for intelligent malware detection , 2018, Knowledge and Information Systems.

[10]  Jiawei Han,et al.  KnowSim: A Document Similarity Measure on Structured Heterogeneous Information Networks , 2015, 2015 IEEE International Conference on Data Mining.

[11]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[12]  T. Holt,et al.  Exploring stolen data markets online: products and market forces , 2010 .

[13]  Christopher Krügel,et al.  The Underground Economy of Fake Antivirus Software , 2011, WEIS.

[14]  Xiang Li,et al.  Meta Structure: Computing Relevance in Large Heterogeneous Information Networks , 2016, KDD.

[15]  Chong-Wah Ngo,et al.  Evaluating bag-of-visual-words representations in scene classification , 2007, MIR '07.

[16]  Martin C. Libicki,et al.  Markets for Cybercrime Tools and Stolen Data: Hackers' Bazaar , 2014 .

[17]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[18]  Philip S. Yu,et al.  PathSim , 2011, Proc. VLDB Endow..

[19]  Xin Li,et al.  Detecting Opioid Users from Twitter and Understanding Their Perceptions Toward MAT , 2017, 2017 IEEE International Conference on Data Mining Workshops (ICDMW).

[20]  Christopher Krügel,et al.  Framing Dependencies Introduced by Underground Commoditization , 2015, WEIS.

[21]  Yizhou Sun,et al.  Mining Heterogeneous Information Networks: Principles and Methodologies , 2012, Mining Heterogeneous Information Networks: Principles and Methodologies.

[22]  Erik Christensen,et al.  The Hadamard product in a crossed product C*-algebra , 2019, 1905.05630.

[23]  Hsinchun Chen,et al.  Exploring hacker assets in underground forums , 2015, 2015 IEEE International Conference on Intelligence and Security Informatics (ISI).

[24]  Thanasis Stengos,et al.  An Empirical Estimation of the Underground Economy in Ghana , 2014 .

[25]  S. Sitharama Iyengar,et al.  A Survey on Malware Detection Using Data Mining Techniques , 2017, ACM Comput. Surv..

[26]  Yanfang Ye,et al.  Combining file content and file relations for cloud based malware detection , 2011, KDD.

[27]  Yanfang Ye,et al.  HinDroid: An Intelligent Android Malware Detection System Based on Structured Heterogeneous Information Network , 2017, KDD.

[28]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[29]  Philip S. Yu,et al.  Mining knowledge from databases: an information network analysis approach , 2010, SIGMOD Conference.

[30]  Stefan Savage,et al.  An analysis of underground forums , 2011, IMC '11.

[31]  T. Holt Examining the Forces Shaping Cybercrime Markets Online , 2013 .

[32]  C. Elkan,et al.  Topic Models , 2008 .

[33]  Stefan Savage,et al.  An inquiry into the nature and causes of the wealth of internet miscreants , 2007, CCS '07.

[34]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[35]  Yanfang Ye,et al.  Malicious sequential pattern mining for automatic malware detection , 2016, Expert Syst. Appl..

[36]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[37]  Dik Lun Lee,et al.  Meta-Graph Based Recommendation Fusion over Heterogeneous Information Networks , 2017, KDD.

[38]  Kalina Trenevska Blagoeva Applying TAM to Study Online Shopping Adoption Among Youth in the Republic of Macedonia , 2018 .

[39]  Guang Liu,et al.  How to Learn Klingon without a Dictionary: Detection and Measurement of Black Keywords Used by the Underground Economy , 2017, 2017 IEEE Symposium on Security and Privacy (SP).

[40]  Nicolas Christin,et al.  Measuring the Longitudinal Evolution of the Online Anonymous Marketplace Ecosystem , 2015, USENIX Security Symposium.

[41]  Vern Paxson,et al.  Tools for Automated Analysis of Cybercriminal Markets , 2017, WWW.

[42]  Yanfang Ye,et al.  SecureDroid: Enhancing Security of Machine Learning-based Detection against Adversarial Android Malware Attacks , 2017, ACSAC.

[43]  Christopher Krügel,et al.  PExy: The Other Side of Exploit Kits , 2014, DIMVA.

[44]  Damon McCoy,et al.  Understanding the Emerging Threat of DDoS-as-a-Service , 2013, LEET.