ATOL: A Framework for Automated Analysis and Categorization of the Darkweb Ecosystem

We present a framework for automated analysis and categorization of .onion websites in the darkweb to facilitate analyst situational awareness of new content that emerges from this dynamic landscape. Over the last two years, our team has developed a large-scale darkweb crawling infrastructure called OnionCrawler that acquires new onion domains on a daily basis, and crawls and indexes millions of pages from these new and previously known .onion sites. It stores this data into a research repository designed to help better understand Tor’s hidden service ecosystem. The analysis component of our framework is called Automated Tool for Onion Labeling (ATOL), which introduces a two-stage thematic labeling strategy: (1) it learns descriptive and discriminative keywords for different categories, and (2) uses these terms to map onion site content to a set of thematic labels. We also present empirical results of ATOL and our ongoing experimentation with it, as we have gained experience applying it to the entirety of our darkweb repository, now over 70 million indexed pages. We find that ATOL can perform site-level thematic label assignment more accurately than keywordbased schemes developed by domain experts — we expand the analyst-provided keywords using an automatic keyword discovery algorithm, and get 12% gain in accuracy by using a machine learning classification model. We also show how ATOL can discover categories on previously unlabeled onions and discuss applications of ATOL in supporting various analyses and investigations of the darkweb.

[1]  Ali Selamat,et al.  Hybridized term-weighting method for Dark Web classification , 2016, Neurocomputing.

[2]  Kevin M. Carter,et al.  Adaptive Attacker Strategy Development Against Moving Target Cyber Defenses , 2014, ArXiv.

[3]  Daniel Lowd,et al.  Automated Attacks on Compression-Based Classifiers , 2015, AISec@CCS.

[4]  Gareth Owen,et al.  Empirical analysis of Tor Hidden Services , 2016, IET Inf. Secur..

[5]  David Mandell Freeman,et al.  Using naive bayes to detect spammy names in social networks , 2013, AISec.

[6]  James R. Foulds,et al.  On the Theory and Practice of Privacy-Preserving Bayesian Data Analysis , 2016, UAI.

[7]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[8]  Nicolas Christin,et al.  Traveling the silk road: a measurement analysis of a large anonymous online marketplace , 2012, WWW.

[9]  Christopher Ré,et al.  DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference , 2012, VLDS.

[10]  Sakshi Jain,et al.  Who Are You? A Statistical Approach to Measuring User Authenticity , 2016, NDSS.

[11]  Mary Anne Wheeler,et al.  Stem , 1985 .

[12]  Pedro M. Domingos,et al.  Adversarial classification , 2004, KDD.

[13]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[14]  Haifeng Xu,et al.  Security Games with Information Leakage: Modeling and Computation , 2015, IJCAI.

[15]  Alex Biryukov,et al.  Trawling for Tor Hidden Services: Detection, Measurement, Deanonymization , 2013, 2013 IEEE Symposium on Security and Privacy.

[16]  Nicolas Christin,et al.  Measuring the Longitudinal Evolution of the Online Anonymous Marketplace Ecosystem , 2015, USENIX Security Symposium.

[17]  Alex Biryukov,et al.  Content and Popularity Analysis of Tor Hidden Services , 2013, 2014 IEEE 34th International Conference on Distributed Computing Systems Workshops (ICDCSW).

[18]  Cao Xiao,et al.  Detecting Clusters of Fake Accounts in Online Social Networks , 2015, AISec@CCS.

[19]  Robert Laddaga,et al.  Adaptive Security and Trust , 2012, 2012 IEEE Sixth International Conference on Self-Adaptive and Self-Organizing Systems Workshops.

[20]  Kevin M. Carter,et al.  Probabilistic threat propagation for malicious activity detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Christopher Meek,et al.  Adversarial learning , 2005, KDD '05.

[22]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.