PUBCRAWL: Protecting Users and Businesses from CRAWLers

Web crawlers are automated tools that browse the web to retrieve and analyze information. Although crawlers are useful tools that help users to find content on the web, they may also be malicious. Unfortunately, unauthorized (malicious) crawlers are increasingly becoming a threat for service providers because they typically collect information that attackers can abuse for spamming, phishing, or targeted attacks. In particular, social networking sites are frequent targets of malicious crawling, and there were recent cases of scraped data sold on the black market and used for blackmailing. In this paper, we introduce PUBCRAWL, a novel approach for the detection and containment of crawlers. Our detection is based on the observation that crawler traffic significantly differs from user traffic, even when many users are hidden behind a single proxy. Moreover, we present the first technique for crawler campaign attribution that discovers synchronized traffic coming from multiple hosts. Finally, we introduce a containment strategy that leverages our detection results to efficiently block crawlers while minimizing the impact on legitimate users. Our experimental results in a large, well-known social networking site (receiving tens of millions of requests per day) demonstrate that PUBCRAWL can distinguish between crawlers and users with high accuracy. We have completed our technology transfer, and the social networking site is currently running PUB-CRAWL in production.

[1]  W. Cleveland,et al.  Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting , 1988 .

[2]  Marios D. Dikaiakos,et al.  An investigation of web crawler behavior: characterization and metrics , 2005, Comput. Commun..

[3]  Eamonn J. Keogh,et al.  On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration , 2002, Data Mining and Knowledge Discovery.

[4]  Riccardo Gusella,et al.  Characterizing the Variability of Arrival Processes with Indexes of Dispersion , 1991, IEEE J. Sel. Areas Commun..

[5]  John C. Mitchell,et al.  The Failure of Noise-Based Non-continuous Audio Captchas , 2011, 2011 IEEE Symposium on Security and Privacy.

[6]  Swapna S. Gokhale,et al.  Discovering New Trends in Web Robot Traffic Through Functional Classification , 2008, 2008 Seventh IEEE International Symposium on Network Computing and Applications.

[7]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[8]  Peter Clark,et al.  The CN2 Induction Algorithm , 1989, Machine Learning.

[9]  Swapna S. Gokhale,et al.  Web robot detection techniques: overview and limitations , 2010, Data Mining and Knowledge Discovery.

[10]  T. Warren Liao,et al.  Clustering of time series data - a survey , 2005, Pattern Recognit..

[11]  Xiaozhu Lin,et al.  An Automatic Scheme to Categorize User Sessions in Modern HTTP Traffic , 2008, IEEE GLOBECOM 2008 - 2008 IEEE Global Telecommunications Conference.

[12]  C. Pipper,et al.  [''R"--project for statistical computing]. , 2008, Ugeskrift for laeger.

[13]  A. Stassopoulou,et al.  Crawler Detection: A Bayesian Approach , 2006, International Conference on Internet Surveillance and Protection (ICISP’06).

[14]  Anália Lourenço,et al.  Applying Clickstream Data Mining to Real-Time Web Crawler Detection and Containment Using ClickTips Platform , 2006, GfKl.

[15]  Irma J. Terpenning,et al.  STL : A Seasonal-Trend Decomposition Procedure Based on Loess , 1990 .

[16]  Vipin Kumar,et al.  Discovery of Web Robot Sessions Based on their Navigational Patterns , 2004, Data Mining and Knowledge Discovery.

[17]  Weigang Guo,et al.  Web robot detection techniques based on statistics of their requested URL resources , 2005, Proceedings of the Ninth International Conference on Computer Supported Cooperative Work in Design, 2005..