BotSeer: An Automated Information System for Analyzing Web Robots

Robots.txt files are vital to the Web since they are supposed to regulate what search engines can and cannot crawl. We present BotSeer, a Web-based information system and search tool that provides resources and services for researching Web robots and trends in Robot exclusion protocol deployment and adherence. BotSeer currently indexes and analyzes 2.2 million robots.txt files obtained from 13.2 million Websites, as well as a large Web server log of real-world robot behavior and related analyses. BotSeer provides three major services including robots.txt searching, robot bias analysis, and robot-generated log analysis. BotSeer serves as are source for studying the regulation and behavior of Web robots as well as a tool to inform the creation of effective robots.txt files and crawler implementations.

[1]  A. Oskamp,et al.  Agent Exclusion on Websites , 2005 .

[2]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[3]  Xie Kanglin Lucene Search Engine , 2007 .

[4]  Khaled Khelif,et al.  Supporting Patent Mining by using Ontology-based Semantic Annotations , 2007, IEEE/WIC/ACM International Conference on Web Intelligence (WI'07).

[5]  M. Carl Drott Indexing aids at corporate websites: the use of robots.txt and META tags , 2002, Inf. Process. Manag..

[6]  C. Lee Giles,et al.  Determining Bias to Search Engines from Robots.txt , 2007, IEEE/WIC/ACM International Conference on Web Intelligence (WI'07).

[7]  Brian Kelly,et al.  Webwatching UK Web Communities: Final Report For The WebWatch Project , 1999 .

[8]  David Eichmann,et al.  2 – Background : Agents in General and Spiders in Particular , 1994 .

[9]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[10]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[11]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[12]  Jaideep Srivastava,et al.  Automatic personalization based on Web usage mining , 2000, CACM.

[13]  Myra Spiliopoulou,et al.  Web usage mining for Web site evaluation , 2000, CACM.

[14]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[15]  C. Lee Giles,et al.  A large-scale study of robots.txt , 2007, WWW '07.

[16]  Myra Spiliopoulou,et al.  Analysis of navigation behaviour in web sites integrating multiple information systems , 2000, The VLDB Journal.

[17]  Jaideep Srivastava,et al.  Grouping Web page references into transactions for mining World Wide Web browsing patterns , 1997, Proceedings 1997 IEEE Knowledge and Data Engineering Exchange Workshop.

[18]  Jaideep Srivastava,et al.  Web usage mining: discovery and applications of usage patterns from Web data , 2000, SKDD.