Structure and Content of the Visible Darknet

In this paper, we analyze the topology and the content found on the "darknet", the set of websites accessible via Tor. We created a darknet spider and crawled the darknet starting from a bootstrap list by recursively following links. We explored the whole connected component of more than 34,000 hidden services, of which we found 10,000 to be online. Contrary to folklore belief, the visible part of the darknet is surprisingly well-connected through hub websites such as wikis and forums. We performed a comprehensive categorization of the content using supervised machine learning. We observe that about half of the visible dark web content is related to apparently licit activities based on our classifier. A significant amount of content pertains to software repositories, blogs, and activism-related websites. Among unlawful hidden services, most pertain to fraudulent websites, services selling counterfeit goods, and drug markets.

[1]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[2]  Robert Krovetz,et al.  Word sense disambiguation for large text databases , 1996 .

[3]  Roy T. Fielding,et al.  Hypertext Transfer Protocol - HTTP/1.1 , 1997, RFC.

[4]  Nicolas Christin,et al.  Measuring the Longitudinal Evolution of the Online Anonymous Marketplace Ecosystem , 2015, USENIX Security Symposium.

[5]  Eduardo Fidalgo,et al.  Classifying Illegal Activities on Tor Network Based on Web Textual Contents , 2017, EACL.

[6]  David A. Hull Stemming Algorithms: A Case Study for Detailed Evaluation , 1996, J. Am. Soc. Inf. Sci..

[7]  David Eichmann,et al.  The RBSE spider — Balancing effective search against Web load , 1994, WWW Spring 1994.

[8]  B. Pinkerton,et al.  Finding What People Want : Experiences with the WebCrawler , 1994, WWW Spring 1994.

[9]  Dirk Grunwald,et al.  Shining Light in Dark Places: Understanding the Tor Network , 2008, Privacy Enhancing Technologies.

[10]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[11]  Gareth Owen,et al.  Empirical analysis of Tor Hidden Services , 2016, IET Inf. Secur..

[12]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[13]  Tamanna Verma AUTOMATIC TEXT CLASSIFICATION AND FOCUSED CRAWLING , 2013 .

[14]  Alex Biryukov,et al.  Content and Popularity Analysis of Tor Hidden Services , 2013, 2014 IEEE 34th International Conference on Distributed Computing Systems Workshops (ICDCSW).

[15]  Jasna Kuljis,et al.  Applying content analysis to Web based content , 2010, Proceedings of the ITI 2010, 32nd International Conference on Information Technology Interfaces.

[16]  Karsten Loesing,et al.  Extrapolating network totals from hidden-service statistics , 2015 .

[17]  Mohamed Ali Kâafar,et al.  Digging into Anonymous Traffic: A Deep Analysis of the Tor Anonymizing Network , 2010, 2010 Fourth International Conference on Network and System Security.

[18]  Yong Liao,et al.  An Uncertainty Sampling-Based Active Learning Approach for Support Vector Machines , 2009, 2009 International Conference on Artificial Intelligence and Computational Intelligence.

[19]  Eduardo Fidalgo,et al.  Illegal Activity Categorisation in DarkNet Based on Image Classification Using CREIC Method , 2017, SOCO-CISIS-ICEUTE.

[20]  Marc Najork,et al.  High-performance Web Crawling High-performance Web Crawling Publication History , 2001 .

[21]  Carlos Castillo,et al.  Effective web crawling , 2005, SIGF.

[22]  Oliver A. McBryan,et al.  GENVL and WWWW: Tools for taming the Web , 1994, WWW Spring 1994.

[23]  Donna K. Harman,et al.  How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[24]  Thomas Rid,et al.  Cryptopolitik and the Darknet , 2016 .