An Effectively Focused Crawling System

In this article, we illustrate design and implementation of a focused crawling system for effectively collecting webpages concerning specific topics. An algorithm for deciding where to crawl next is developed by exploiting not only anchor texts but also the concept of PageRank. Given a topic to be focused on, our system attempts to collect webpages concerning the topic by crawling webpages that are expected to have not only close similarities to the topic but also high rank. Experimental results using many topics are reported and investigated in this article.

[1]  Gerhard Friedrich,et al.  xCrawl: a high-recall crawling method for Web mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[2]  Sandeep Pandey,et al.  User-centric Web crawling , 2005, WWW '05.

[3]  Stuart Macdonald,et al.  User Engagement in Research Data Curation , 2009, ECDL.

[4]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[5]  Marc Najork,et al.  Breadth-first crawling yields high-quality pages , 2001, WWW '01.

[6]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[7]  Sandeep Pandey,et al.  Recrawl scheduling based on information longevity , 2008, WWW.

[8]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[9]  Ricardo A. Baeza-Yates,et al.  Crawling a country: better strategies than breadth-first for web page ordering , 2005, WWW '05.

[10]  Filippo Menczer,et al.  A General Evaluation Framework for Topical Crawlers , 2005, Information Retrieval.

[11]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[12]  Filippo Menczer,et al.  Topical Crawling for Business Intelligence , 2003, ECDL.

[13]  Padmini Srinivasan,et al.  Link Contexts in Classifier-Guided Topical Crawlers , 2006, IEEE Trans. Knowl. Data Eng..

[14]  Lakhmi C. Jain,et al.  Knowledge-Based Intelligent Information and Engineering Systems , 2004, Lecture Notes in Computer Science.

[15]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[16]  Teruaki Kitasuka,et al.  Where to Crawl Next for Focused Crawlers , 2010, KES.

[17]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[18]  Nick Craswell,et al.  The impact of crawl policy on web search effectiveness , 2009, SIGIR.

[19]  Padmini Srinivasan,et al.  Learning to crawl: Comparing classification schemes , 2005, TOIS.

[20]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[21]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[22]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[23]  Hans-Peter Kriegel,et al.  Accurate and Efficient Crawling for Relevant Websites , 2004, VLDB.

[24]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..

[25]  Junghoo Cho,et al.  RankMass crawler: a crawler with high personalized pagerank coverage guarantee , 2007, VLDB 2007.