Analyzing Web Robots and Their Impact on Caching

Understanding the nature and the characteristics of Web robots is an essential step to analyze their impact on caching. Using a multi-layer hierarchical workload model, this paper presents a characterization of the workload generated by autonomous agents and robots. This characterization focuses on the statistical properties of the arrival process and on the robot behavior graph model. A set of criteria is proposed for identifying robots in real logs. We then identify and characterize robots from real logs applying a multi-layered approach. Using a stack distance based analytical model for the interaction between robots and Web site caching, we assess the impact of robots' requests on Web caches. Our analyses point out that robots cause a signi cant increase in the miss ratio of a server-side cache. Robots have a referencing pattern that completely disrupts locality assumptions. These results indicate not only the need for a better understanding of the behavior of robots, but also the need of Web caching policies that treat robots' requests di erently than human generated requests.

[1]  W. Matthews,et al.  Internet end-to-end performance monitoring for the High Energy Nuclear and Particle Physics community , 2000 .

[2]  Virgílio A. F. Almeida,et al.  Business-oriented resource management policies for e-commerce servers , 2000, Perform. Evaluation.

[3]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[4]  Virgílio A. F. Almeida,et al.  Characterizing reference locality in the WWW , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[5]  Douglas S. Reeves,et al.  Optimal Web cache sizing: scalable methods for exact solutions , 2001, Comput. Commun..

[6]  FeldmannA.,et al.  The changing nature of network traffic , 1998 .

[7]  M. Klemettinen,et al.  Www Robots and Search Engines , 1996 .

[8]  Sally Floyd,et al.  Wide area traffic: the failure of Poisson modeling , 1995, TNET.

[9]  Sally Floyd,et al.  Wide-area traffic: the failure of Poisson modeling , 1994 .

[10]  Virgílio A. F. Almeida,et al.  In search of invariants for e-business workloads , 2000, EC '00.

[11]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[12]  Divesh Srivastava,et al.  Interaction of query evaluation and buffer management for information retrieval , 1998, SIGMOD '98.

[13]  Pang-Ning Tan,et al.  Modeling of Web Robot Navigational Patterns , 2000 .

[14]  Anja Feldmann,et al.  The changing nature of network traffic: scaling phenomena , 1998, CCRV.