Analyzing and Identifying Data Breaches in Underground Forums

Recently, underground forums play a crucial role in trading and exchanging leaked personal information. Meanwhile, the forums have been gradually used as data breaches’ information sources. Therefore, it shows an upward trend in announcing the results of data theft by posting in the forums. Identifying these threads can make the compromised third-party respond quickly to the data breach incident. For this purpose, we presented a system to identify the threads which are related to data breaches automatically. The system can monitor and discover data breaches in underground forums in real-time. In addition, the study further revealed the wording characteristics of the threads by applying the feature extraction method based on LDA topic model. In this paper, the data set was collected from the surface web and the dark web. Besides, to improve the performance of the system, we compared various supervised classification algorithms in this application scenario and selected the best method for the classifier. Through the system, we identified more than 92% of data breach threads on the experimental data set.

[1]  Yuval Elovici,et al.  CoBAn: A context based model for data leakage prevention , 2014, Inf. Sci..

[2]  Christoph Meinel,et al.  Gathering and Analyzing Identity Leaks for Security Awareness , 2014, PASSWORDS.

[3]  Yang Bo,et al.  A Method for Topic Classification of Web Pages Using LDA-SVM Model , 2017 .

[4]  Apoorva Kulkarni,et al.  Data leakage detection , 2018 .

[5]  Brad Wardman,et al.  REAPER: an automated, scalable solution for mass credential harvesting and OSINT , 2016, 2016 APWG Symposium on Electronic Crime Research (eCrime).

[6]  Danfeng Yao,et al.  Breaking the Target: An Analysis of Target Data Breach and Lessons Learned , 2017, ArXiv.

[7]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[8]  Paula Buttery,et al.  Characterizing Eve: Analysing Cybercrime Actors in a Large Underground Forum , 2018, RAID.

[9]  Rob Thomas,et al.  The underground economy: priceless , 2006 .

[10]  Minqiang Li,et al.  A Hierarchy Method Based on LDA and SVM for News Classification , 2014, 2014 IEEE International Conference on Data Mining Workshop.

[11]  Tunga Güngör,et al.  LDA-based keyword selection in text categorization , 2009, 2009 24th International Symposium on Computer and Information Sciences.

[12]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[13]  Carmela Troncoso,et al.  Under the Underground: Predicting Private Interactions in Underground Forums , 2018, ArXiv.

[14]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[15]  Vern Paxson,et al.  Data Breaches, Phishing, or Malware?: Understanding the Risks of Stolen Credentials , 2017, CCS.

[16]  Daniele Quercia,et al.  TweetLDA: supervised topic classification and link prediction in Twitter , 2012, WebSci '12.

[17]  Jian Liu,et al.  iDetector: Automate Underground Forum Analysis Based on Heterogeneous Information Network , 2018, 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[18]  Vern Paxson,et al.  Tools for Automated Analysis of Cybercriminal Markets , 2017, WWW.

[19]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[20]  Michael Röder,et al.  Exploring the Space of Topic Coherence Measures , 2015, WSDM.

[21]  Christopher Krügel,et al.  Framing Dependencies Introduced by Underground Commoditization , 2015, WEIS.

[22]  Stefan Savage,et al.  An inquiry into the nature and causes of the wealth of internet miscreants , 2007, CCS '07.

[23]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[24]  Hsinchun Chen,et al.  Targeting key data breach services in underground supply chain , 2016, 2016 IEEE Conference on Intelligence and Security Informatics (ISI).

[25]  G. Stringhini,et al.  What Happens After You Are Pwnd : Understanding The Use Of Leaked Account Credentials In The Wild , 2016 .

[26]  David Buttler,et al.  Exploring Topic Coherence over Many Models and Many Topics , 2012, EMNLP.

[27]  Stefan Savage,et al.  An analysis of underground forums , 2011, IMC '11.

[28]  Christian Platzer,et al.  Covertly Probing Underground Economy Marketplaces , 2010, DIMVA.