Where to Crawl Next for Focused Crawlers

Since WWW provides a large amount of data, it is useful for innovative and creative activities of human beings to retrieve interesting and useful information effectively and efficiently from WWW. In this paper, we attempt to propose a focused crawler for individual activities. We develop an algorithm for deciding where to crawl next for focused crawlers, by integrating the concept of PageRank into the decision. We empirically evaluate our proposal in terms of precision and target recall. Some results show that our system can give good target recall performance regardless of topics on which the crawler system focuses.

[1]  Filippo Menczer,et al.  Topical Crawling for Business Intelligence , 2003, ECDL.

[2]  Padmini Srinivasan,et al.  Learning to crawl: Comparing classification schemes , 2005, TOIS.

[3]  Padmini Srinivasan,et al.  Link Contexts in Classifier-Guided Topical Crawlers , 2006, IEEE Trans. Knowl. Data Eng..

[4]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..

[5]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[6]  Junghoo Cho,et al.  RankMass crawler: a crawler with high personalized pagerank coverage guarantee , 2007, VLDB 2007.

[7]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[8]  Stuart Macdonald,et al.  User Engagement in Research Data Curation , 2009, ECDL.

[9]  Víctor Pàmies,et al.  Open Directory Project , 2003 .

[10]  Marc Najork,et al.  Breadth-first crawling yields high-quality pages , 2001, WWW '01.

[11]  Junghoo Cho,et al.  RankMass Crawler: A Crawler with High PageRank Coverage Guarantee , 2007, VLDB.

[12]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[13]  Ricardo A. Baeza-Yates,et al.  Crawling a country: better strategies than breadth-first for web page ordering , 2005, WWW '05.

[14]  Filippo Menczer,et al.  A General Evaluation Framework for Topical Crawlers , 2005, Information Retrieval.

[15]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[16]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[17]  Hans-Peter Kriegel,et al.  Accurate and Efficient Crawling for Relevant Websites , 2004, VLDB.