Cluster-based page segmentation-a fast and precise method for web page pre-processing

Segmenting a web page may be one of initial steps of information retrieval or content classification performed on that page. While there has been an extensive research in this area, the approaches usually focus either on performance or quality of the results. Vision based segmentation is one of the quality focused methods, which are considerably slow. This paper proposes an approach for boosting the performance of vision based algorithms. Our approach is based on concepts of modern web and a very common scenario in which an entire web site is processed at once. In this scenario, a great amount of performance boost can be gained by isomorphic mapping of previous results gathered from pages within the site to other pages on the same site. We provide the results of experiments performed on VIPS, the most common algorithm for page segmentation.

[1]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[2]  Juliana Freire,et al.  On Finding Templates on Web Collections , 2009, World Wide Web.

[3]  not Cwi,et al.  XHTML™ 1.0 The Extensible HyperText Markup Language , 2002 .

[4]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[5]  Wei-Ying Ma,et al.  Improving pseudo-relevance feedback in web information retrieval using web page segmentation , 2003, WWW '03.

[6]  Andrew Tomkins,et al.  The volume and evolution of web page templates , 2005, WWW '05.

[7]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[8]  Vangelis Karkaletsis,et al.  Segmenting HTML pages using visual and semantic information , 2008 .

[9]  Juliana Freire,et al.  A fast and robust method for web page template detection and removal , 2006, CIKM '06.

[10]  Xiaoli Li,et al.  Eliminating noisy information in Web pages for data mining , 2003, KDD '03.

[11]  Jer Lang Hong,et al.  Information extraction for search engines using fast heuristic techniques , 2010, Data Knowl. Eng..

[12]  Thomas Gottron Bridging the gap: from multi document Template Detection to single document Content Extraction , 2008, EuroIMSA 2008.

[13]  Radek Burget Layout Based Information Extraction from HTML Documents , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[14]  Wei Liu,et al.  ViDE: A Vision-Based Approach for Deep Web Data Extraction , 2010, IEEE Transactions on Knowledge and Data Engineering.

[15]  Eduardo Sany Laber,et al.  A fast and simple method for extracting relevant content from news webpages , 2009, CIKM.

[16]  Miroslav Spousta,et al.  Victor : the Web-Page Cleaning Tool , 2008 .

[17]  Gabriel Valiente,et al.  An Efficient Bottom-Up Distance between Trees , 2001, SPIRE.