Protecting Web Contents Against Persistent Crawlers

Web crawlers have been developed for several malicious purposes like downloading server data without permission from website administrator. Armored stealthy crawlers are evolving against new anti-crawler mechanisms in the arms race between the crawler developers and crawler defenders. In this thesis, we develop a new anti-crawler mechanism called PathMarker to detect and constrain crawlers that crawl content of servers stealthily and persistently. The basic idea is to add a marker to each web page URL and then encrypt the URL and marker. By using the URL path and user information contained in the marker as the new features of our detection modules, we could accurately detect stealthy crawlers even most distributed crawlers at the earliest stage. Besides effectively detecting crawlers, PathMarker can also dramatically suppress the efficiency of crawlers before they are detected by misleading the crawlers visiting same page’s URL with different markers. We deploy our approach on a forum website to collect normal users’ data. The evaluation results show that PathMarker can quickly capture all 12 open-source and in-house crawlers, plus two external crawlers (i.e., Googlebots and Yahoo Slurp).

[1]  Aijun An,et al.  Feature evaluation for web crawler detection with data mining techniques , 2012, Expert Syst. Appl..

[2]  Fatemeh Ahmadi-Abkenari,et al.  An architecture for a focused trend parallel Web crawler with the application of clickstream analysis , 2012, Inf. Sci..

[3]  Marios D. Dikaiakos,et al.  Web robot detection: A probabilistic reasoning approach , 2009, Comput. Networks.

[4]  Qifa Ke,et al.  SBotMiner: large scale search bot detection , 2010, WSDM '10.

[5]  A. Stassopoulou,et al.  Crawler Detection: A Bayesian Approach , 2006, International Conference on Internet Surveillance and Protection (ICISP’06).

[6]  Christopher Krügel,et al.  PUBCRAWL: Protecting Users and Businesses from CRAWLers , 2012, USENIX Security Symposium.

[7]  Ali Eydgahi,et al.  A novel defense mechanism against web crawlers intrusion , 2013, 2013 International Conference on Electronics, Computer and Computation (ICECCO).

[8]  Swapna S. Gokhale,et al.  Web robot detection techniques: overview and limitations , 2010, Data Mining and Knowledge Discovery.

[9]  Ricardo A. Baeza-Yates,et al.  Crawling a country: better strategies than breadth-first for web page ordering , 2005, WWW '05.

[10]  Lars Schmidt-Thieme,et al.  Web Robot Detection - Preprocessing Web Logfiles for Robot Detection , 2005 .

[11]  Marios D. Dikaiakos,et al.  An investigation of web crawler behavior: characterization and metrics , 2005, Comput. Commun..

[12]  Zhenyu Wu,et al.  Battle of Botcraft: fighting bots in online games with human observational proofs , 2009, CCS.

[13]  Gianluca Stringhini,et al.  EVILCOHORT: Detecting Communities of Malicious Accounts on Online Services , 2015, USENIX Security Symposium.

[14]  Kang-Won Lee,et al.  Securing Web Service by Automatic Robot Detection , 2006, USENIX Annual Technical Conference, General Track.

[15]  Vipin Kumar,et al.  Discovery of Web Robot Sessions Based on their Navigational Patterns , 2004, Data Mining and Knowledge Discovery.

[16]  Feng Mao,et al.  Evasive bots masquerading as human beings on the web , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[17]  Weigang Guo,et al.  Web robot detection techniques based on statistics of their requested URL resources , 2005, Proceedings of the Ninth International Conference on Computer Supported Cooperative Work in Design, 2005..

[18]  Michalis Faloutsos,et al.  Scanner hunter: understanding HTTP scanning traffic , 2014, AsiaCCS.

[19]  Rich Gossweiler,et al.  WWW 2009 MADRID! Track: User Interfaces and Mobile Web / Session: User Interfaces What’s Up CAPTCHA? A CAPTCHA Based on Image Orientation , 2022 .

[20]  Aijun An,et al.  Detection of malicious and non-malicious website visitors using unsupervised neural network learning , 2013, Appl. Soft Comput..

[21]  Hyungkyu Lee,et al.  Classification of web robots: An empirical study based on over one billion requests , 2009, Comput. Secur..

[22]  Grazyna Suchacka,et al.  Detection of Internet robots using a Bayesian approach , 2015, 2015 IEEE 2nd International Conference on Cybernetics (CYBCONF).

[23]  V. S. Dhaka,et al.  Web Crawler: A Review , 2013 .

[24]  Swapna S. Gokhale,et al.  A comparison of Web robot and human requests , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[25]  DeXiang Zhang,et al.  A Novel Malicious Web Crawler Detector: Performance and Evaluation , 2013 .

[26]  Ling Huang,et al.  ANTIDOTE: understanding and defending against poisoning of anomaly detectors , 2009, IMC '09.

[27]  Juliana Freire,et al.  An adaptive crawler for locating hidden-Web entry points , 2007, WWW '07.

[28]  Euripides G. M. Petrakis,et al.  Improving the performance of focused web crawlers , 2009, Data Knowl. Eng..

[29]  Anália Lourenço,et al.  Applying Clickstream Data Mining to Real-Time Web Crawler Detection and Containment Using ClickTips Platform , 2006, GfKl.

[30]  Clément de Groc Babouk: Focused Web Crawling for Corpus Compilation and Automatic Terminology Extraction , 2011, 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[31]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[32]  Richard Zanibbi,et al.  Video CAPTCHAs: Usability vs. Security , 2008 .

[33]  Steven Gianvecchio,et al.  Measurement and Classification of Humans and Bots in Internet Chat , 2008, USENIX Security Symposium.