Topical web crawlers: Evaluating adaptive algorithms

Topical crawlers are increasingly seen as a way to address the scalability limitations of universal search engines, by distributing the crawling process across users, queries, or even client computers. The context available to such crawlers can guide the navigation of links with the goal of efficiently locating highly relevant target pages. We developed a framework to fairly evaluate topical crawling algorithms under a number of performance metrics. Such a framework is employed here to evaluate different algorithms that have proven highly competitive among those proposed in the literature and in our own previous research. In particular we focus on the tradeoff between exploration and exploitation of the cues available to a crawler, and on adaptive crawlers that use machine learning techniques to guide their search. We find that the best performance is achieved by a novel combination of explorative and exploitative bias, and introduce an evolutionary crawler that surpasses the performance of the best nonadaptive crawler after sufficiently long crawls. We also analyze the computational complexity of the various crawlers and discuss how performance and complexity scale with available resources. Evolutionary crawlers achieve high efficiency and scalability by distributing the work across concurrent agents, resulting in the best performance/cost ratio.

[1]  Yoelle Maarek,et al.  The Shark-Search Algorithm. An Application: Tailored Web Site Mapping , 1998, Comput. Networks.

[2]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[3]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[4]  Philip S. Yu,et al.  Intelligent crawling on the World Wide Web with arbitrary predicates , 2001, WWW '01.

[5]  Filippo Menczer,et al.  Adaptive Retrieval Agents: Internalizing Local Context and Scaling up to the Web , 2000, Machine Learning.

[6]  Andrew McCallum,et al.  Using Reinforcement Learning to Spider the Web Efficiently , 1999, ICML.

[7]  Filippo Menczer,et al.  Adaptive information agents in distributed textual environments , 1998, AGENTS '98.

[8]  B. Pinkerton,et al.  Finding What People Want : Experiences with the WebCrawler , 1994, WWW Spring 1994.

[9]  Filippo Menczer,et al.  Complementing search engines with online web mining agents , 2003, Decis. Support Syst..

[10]  Filippo Menczer,et al.  Exploration versus Exploitation in Topic Driven Crawlers , 2002, WebDyn@WWW.

[11]  Filippo Menczer,et al.  Search Engine-Crawler Symbiosis , 2002 .

[12]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[13]  Z. Z. Nick,et al.  Web search using a genetic algorithm , 2001 .

[14]  Craig E. Wills,et al.  Towards a Better Understanding of Web Resources and Server Responses for Improved Caching , 1999, Comput. Networks.

[15]  C. Lee Giles,et al.  Self-Organization and Identification of Web Communities , 2002, Computer.

[16]  Themis Panayiotopoulos,et al.  Web Search Using a Genetic Algorithm , 2001, IEEE Internet Comput..

[17]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[18]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[19]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[20]  Taher H. Haveliwala Efficient Computation of PageRank , 1999 .

[21]  Giles,et al.  Searching the world wide Web , 1998, Science.

[22]  Giorgos Zacharia,et al.  Evolving a multi-agent information filtering solution in Amalthaea , 1997, AGENTS '97.

[23]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[24]  Reinier Post,et al.  Information Retrieval in the World-Wide Web: Making Client-Based Searching Feasible , 1994, Comput. Networks ISDN Syst..

[25]  Filippo Menczer,et al.  ARCCHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods , 1997, ICML.

[26]  George Cybenko,et al.  How dynamic is the Web? , 2000, Comput. Networks.

[27]  Gary William Flake,et al.  Self-organization of the web and identification of communities , 2002 .

[28]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[29]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[30]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[31]  Filippo Menczer,et al.  ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery , 1997, ICML 1997.

[32]  Filippo Menczer,et al.  MySpiders: Evolve Your Own Intelligent Web Crawlers , 2002, Autonomous Agents and Multi-Agent Systems.

[33]  Filippo Menczer,et al.  Scalable Web Search by Adaptive Online Agents: An InfoSpiders Case Study , 1999 .

[34]  Filippo Menczer,et al.  Lexical and semantic clustering by Web links , 2004, J. Assoc. Inf. Sci. Technol..

[35]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[36]  Eli Upfal,et al.  Stochastic models for the Web graph , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[37]  Filippo Menczer,et al.  Topical Crawling for Business Intelligence , 2003, ECDL.

[38]  Jon Kleinberg,et al.  The Structure of the Web , 2001, Science.

[39]  Filippo Menczer,et al.  A General Evaluation Framework for Topical Crawlers , 2005, Information Retrieval.

[40]  Israel Ben-Shaul,et al.  Adding Support for Dynamic and Focused Search with Fetuccino , 1999, Comput. Networks.

[41]  Filippo Menczer,et al.  Evaluating topic-driven web crawlers , 2001, SIGIR '01.

[42]  Marc Najork,et al.  Breadth-First Search Crawling Yields High-Quality Pages , 2001 .

[43]  Andrew McCallum,et al.  A Machine Learning Approach to Building Domain-Specific Search Engines , 1999, IJCAI.

[44]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[45]  Marc Najork,et al.  Measuring Index Quality Using Random Walks on the Web , 1999, Comput. Networks.

[46]  Cyveillance Sizing the Internet , 2000 .

[47]  Marc Najork,et al.  Breadth-first crawling yields high-quality pages , 2001, WWW '01.

[48]  A. Patel,et al.  A Topic-Specific Web Robot Model Based on Restless Bandits , 2001, IEEE Internet Comput..