Focused Crawling Using Vision-Based Page Segmentation

Crawling the web to find relevant pages of the desired topics is called focused crawling. In this paper we propose a focused crawling method based on vision-based page segmentation (VIPS) algorithm. VIPS determines related parts of a web page which is called page blocks. The proposed method considers the text of the block as the link contexts of containing links of the block. Link contexts are terms that appear around the hyperlinks within the text of the web page. Since VIPS algorithm utilizes visual clues in the page segmentation process and is independent from the HTML structure of the page, it can find link contexts in an accurate manner. Our empirical study show higher performance of the proposed focused crawling method in comparison with the existing state of the art results.

[1]  Wanli Zuo,et al.  Tunneling enhanced by web page content block partition for focused crawling , 2008, Concurr. Comput. Pract. Exp..

[2]  Reinier Post,et al.  Information Retrieval in the World-Wide Web: Making Client-Based Searching Feasible , 1994, Comput. Networks ISDN Syst..

[3]  Soumen Chakrabarti,et al.  Accelerated focused crawling through online relevance feedback , 2002, WWW.

[4]  Padmini Srinivasan,et al.  Link Contexts in Classifier-Guided Topical Crawlers , 2006, IEEE Trans. Knowl. Data Eng..

[5]  Patricia Bouyer,et al.  Improved undecidability results on weighted timed automata , 2006, Inf. Process. Lett..

[6]  Chun Chen,et al.  On-line topical importance estimation: an effective focused crawling algorithm combining link and content analysis , 2009 .

[7]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[8]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[9]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[10]  Wanli Zuo,et al.  Tunneling enhanced by web page content block partition for focused crawling: Research Articles , 2008 .

[11]  Andrew McCallum,et al.  Using Reinforcement Learning to Spider the Web Efficiently , 1999, ICML.

[12]  Dirk Lewandowski,et al.  A three-year study on the freshness of web search engine databases , 2008, J. Inf. Sci..

[13]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[14]  Giuseppe Attardi,et al.  Automatic Web Page Categorization by Link and Context Analysis , 1999 .

[15]  Wei-Ying Ma,et al.  Block-based web search , 2004, SIGIR '04.

[16]  Euripides G. M. Petrakis,et al.  Improving the performance of focused web crawlers , 2009, Data Knowl. Eng..