Web robot detection based on pattern-matching technique

In web robot detection it is important is to find features that are common characteristics of diverse robots, in order to differentiate between them and humans. Existing approaches employ fairly simple features (e.g. empty referrer field, interval between successive requests), which often fail to reflect web robots’ behaviour accurately. False alarms may therefore occur unacceptably often. In this paper we propose a fresh approach that expresses the behaviour of interactive users and various web robots in terms of a sequence of request types, called request patterns. Previous proposals have primarily targeted the detection of text crawlers, but our approach works well on many other web robots, such as image crawlers, email collectors and link checkers. In empirical evaluation of more than 1 billion requests collected at www.microsoft.com, our approach achieved 94% accuracy in web robot detection, estimated by F-measure. A decision tree algorithm proposed by Tan and Kumar was also applied to the same data. A comparison shows that the proposed approach is more accurate, and that real-time detection of web robots is feasible.

[1]  Marios D. Dikaiakos,et al.  Web robot detection: A probabilistic reasoning approach , 2009, Comput. Networks.

[2]  Anália Lourenço,et al.  Catching web crawlers in the act , 2006, ICWE '06.

[3]  Maurizio Vichi New Developments in Classification and Data Analysis , 2005 .

[4]  M. HamidR.Jamali,et al.  Web robot detection in the scholarly information environment , 2008, J. Inf. Sci..

[5]  Swapna S. Gokhale,et al.  Web robot detection techniques: overview and limitations , 2010, Data Mining and Knowledge Discovery.

[6]  Xiaozhu Lin,et al.  An Automatic Scheme to Categorize User Sessions in Modern HTTP Traffic , 2008, IEEE GLOBECOM 2008 - 2008 IEEE Global Telecommunications Conference.

[7]  Lars Schmidt-Thieme,et al.  Web Robot Detection - Preprocessing Web Logfiles for Robot Detection , 2005 .

[8]  Dror G. Feitelson,et al.  Distinguishing humans from robots in web search logs: preliminary results using query rates and intervals , 2009, WSCD '09.

[9]  Virgílio A. F. Almeida,et al.  Analyzing Web Robots and Their Impact on Caching , 2001 .

[10]  Vipin Kumar,et al.  Discovery of Web Robot Sessions Based on their Navigational Patterns , 2004, Data Mining and Knowledge Discovery.

[11]  Hyungkyu Lee,et al.  Classification of web robots: An empirical study based on over one billion requests , 2009, Comput. Secur..

[12]  Shaozhi Ye,et al.  Workload-Aware Web Crawling and Server Workload Detection , 2004 .

[13]  Jan Vanthienen,et al.  Evaluation of Web Robot Discovery Techniques: A Benchmarking Study , 2006, ICDM.