On-line topical importance estimation: an effective focused crawling algorithm combining link and content analysis

Focused crawling is an important technique for topical resource discovery on the Web. The key issue in focused crawling is to prioritize uncrawled uniform resource locators (URLs) in the frontier to focus the crawling on relevant pages. Traditional focused crawlers mainly rely on content analysis. Link-based techniques are not effectively exploited despite their usefulness. In this paper, we propose a new frontier prioritizing algorithm, namely the on-line topical importance estimation (OTIE) algorithm. OTIE combines link- and content-based analysis to evaluate the priority of an uncrawled URL in the frontier. We performed real crawling experiments over 30 topics selected from the Open Directory Project (ODP) and compared harvest rate and target recall of the four crawling algorithms: breadth-first, link-context-prediction, on-line page importance computation (OPIC) and our OTIE. Experimental results showed that OTIE significantly outperforms the other three algorithms on the average target recall while maintaining an acceptable harvest rate. Moreover, OTIE is much faster than the traditional focused crawling algorithm.

[1]  David Hawking,et al.  Focused crawling for both topical relevance and quality of medical information , 2005, CIKM '05.

[2]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1998, SODA '98.

[3]  Kelun Tian Combining Link-Based and Content-Based Classification Method , 2011, WISM.

[4]  Anil K. Jain,et al.  Artificial Neural Networks: A Tutorial , 1996, Computer.

[5]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[6]  Babak Bagheri Hariri,et al.  A Method for Focused Crawling Using Combination of Link Structure and Content Similarity , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[7]  Ioannis Pitas,et al.  Focused Crawling Using Latent Semantic Indexing - An Application for Vertical Search Engines , 2005, ECDL.

[8]  Kiduk Yang Combining Text- and Link-based Retrieval Methods for Web IR , 2001, TREC.

[9]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[10]  Charu C. Aggarwal,et al.  Collaborative crawling: mining user experiences for topical resource discovery , 2002, KDD.

[11]  Michael Chau,et al.  Comparison of Three Vertical Search Spiders , 2003, Computer.

[12]  Andrew McCallum,et al.  Using Reinforcement Learning to Spider the Web Efficiently , 1999, ICML.

[13]  Padmini Srinivasan,et al.  Link Contexts in Classifier-Guided Topical Crawlers , 2006, IEEE Trans. Knowl. Data Eng..

[14]  Chun Chen,et al.  Guide focused crawler efficiently and effectively using on-line topical importance estimation , 2008, SIGIR '08.

[15]  Charles Elkan,et al.  Boosting and Naive Bayesian learning , 1997 .

[16]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[17]  Brian D. Davison Topical locality in the Web , 2000, SIGIR '00.

[18]  Serge Abiteboul,et al.  Adaptive on-line page importance computation , 2003, WWW '03.

[19]  Berthier A. Ribeiro-Neto,et al.  Link-based and content-based evidential information in a belief network model , 2000, SIGIR '00.

[20]  Filippo Menczer,et al.  Exploration versus Exploitation in Topic Driven Crawlers , 2002, WebDyn@WWW.

[21]  K. Adolph,et al.  Learning to crawl. , 1998, Child development.

[22]  Filippo Menczer,et al.  Evaluating topic-driven web crawlers , 2001, SIGIR '01.

[23]  Berthier A. Ribeiro-Neto,et al.  Combining link-based and content-based methods for web document classification , 2003, CIKM '03.

[24]  Filippo Menczer,et al.  ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery , 1997, ICML 1997.

[25]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[26]  Padmini Srinivasan,et al.  Learning to crawl: Comparing classification schemes , 2005, TOIS.

[27]  Filippo Menczer,et al.  A General Evaluation Framework for Topical Crawlers , 2005, Information Retrieval.

[28]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[29]  Philip S. Yu,et al.  Intelligent crawling on the World Wide Web with arbitrary predicates , 2001, WWW '01.