Evolving Strategies for Focused Web Crawling

The rapid growth of the World Wide Web has created many challenges for both general purpose crawling, search engines and web directories, making it difficult to find, index, and classify web pages based on a topic. Topic driven crawlers can complement search engines because they pre-classify the pages retrieved by the crawl. To implement such a focused crawler, a strategy for ordering the crawl frontier is required. Such a strategy can only use information gleaned from previously crawled pages to estimate the relevance of a newly observed URL. Because the best strategy for ranking URLsin the crawl frontier is not immediately apparent, we discover strategies by evolving them using a genetic algorithm. Strategies are learned by evaluating the results of crawls simulated using a database generated by a previous, more general crawl. We conclude that a rank function that combines analysis of text and link structure yields effective strategies. The evolved strategies perform better than the commonly used Best First strategy.

[1]  Soumen Chakrabarti,et al.  Accelerated focused crawling through online relevance feedback , 2002, WWW.

[2]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[3]  David C. Gibbon,et al.  Relevance Feedback using Support Vector Machines , 2001, ICML.

[4]  Filippo Menczer,et al.  MySpiders: Evolve Your Own Intelligent Web Crawlers , 2002, Autonomous Agents and Multi-Agent Systems.

[5]  C. Lee Giles,et al.  Efficient identification of Web communities , 2000, KDD '00.

[6]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[7]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[8]  Larry J. Eshelman,et al.  The CHC Adaptive Search Algorithm: How to Have Safe Search When Engaging in Nontraditional Genetic Recombination , 1990, FOGA.

[9]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[10]  Filippo Menczer,et al.  Evaluating topic-driven web crawlers , 2001, SIGIR '01.

[11]  Sushil J. Louis,et al.  Solving Similar Problems Using Genetic Algorithms and Case-Based Memory , 1997, ICGA.

[12]  Thorsten Joachims,et al.  Text categorization with support vector machines , 1999 .

[13]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[14]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[15]  Filippo Menczer,et al.  Topical web crawlers: Evaluating adaptive algorithms , 2004, TOIT.

[16]  Filippo Menczer,et al.  A General Evaluation Framework for Topical Crawlers , 2005, Information Retrieval.

[17]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.