A Novel Focused Crawler Based on Breadcrumb Navigation

In this paper, a novel focused crawler based on Breadcrumb Navigation (BN) is proposed. It mainly leverages Breadcrumb Navigation in the webpages to reconstruct the website structures and resolve focused crawling problems. Different from some previous focused crawlers which use prediction models, the BN crawler firstly samples the web to construct the semantic forest for websites based on Breadcrumb Navigation, and then searches the forest to find the sub-trees relevant to the given topic. After sampling, the BN crawler only needs to download the webpages belonging to the relevant sub-forest. By using this method, the BN crawler costs less time to analyze the Webpage-to-Topic (W2T) similarity but results in a highly efficient performance. The experimental evidences show that the BN crawler significantly outperforms Breadth-First and Best-First crawlers in harvest ratio and can be widely used for most websites.

[1]  Lorenzo Blanco,et al.  Highly efficient algorithms for structural clustering of large websites , 2011, WWW.

[2]  Debashis Hati,et al.  An Approach for Identifying URLs Based on Division Score and Link Score in Focused Crawler , 2010 .

[3]  Sheng-Yuan Yang,et al.  OntoCrawler: A focused crawler with ontology-supported website models for information agents , 2010, Expert Syst. Appl..

[4]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[5]  Sheng-Yuan Yang,et al.  A Focused Crawler with Ontology-Supported Website Models for Information Agents , 2010, GPC.

[6]  Min-Yen Kan,et al.  Fast webpage classification using URL features , 2005, CIKM '05.

[7]  Li Zhang,et al.  Focused crawling using navigational rank , 2010, CIKM '10.

[8]  Teruaki Kitasuka,et al.  Where to Crawl Next for Focused Crawlers , 2010, KES.

[9]  Evangelos E. Milios,et al.  PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING , 2004, WIDM '04.

[10]  Euripides G. M. Petrakis,et al.  Improving the performance of focused web crawlers , 2009, Data Knowl. Eng..

[11]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[12]  Debashis Hati,et al.  UDBFC: An effective focused crawling approach based on URL Distance calculation , 2010, 2010 3rd International Conference on Computer Science and Information Technology.

[13]  Renu Vig,et al.  Multilingual Context Ontology Rule Enhanced Focused Web Crawler , 2010 .

[14]  Jung-Hsien Chiang,et al.  Ontology-Based Intelligent Web Mining Agent for Taiwan Travel , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[15]  Yuekui Yang,et al.  Focused Web Crawling Based on Incremental Learning , 2010 .

[16]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.