Indexing and querying segmented web pages: the BlockWeb Model

We present in this paper a model for indexing and querying web pages, based on the hierarchical decomposition of pages into blocks. Splitting up a page into blocks has several advantages in terms of page design, indexing and querying such as (i) blocks of a page most similar to a query may be returned instead of the page as a whole (ii) the importance of a block can be taken into account, as well as (iii) the permeability of the blocks to neighbor blocks: a block b is said to be permeable to a block b′ in the same page if b′ content (text, image, etc.) can be (partially) inherited by b upon indexing. An engine implementing this model is described including: the transformation of web pages into blocks hierarchies, the definition of a dedicated language to express indexing rules and the storage of indexed blocks into an XML repository. The model is assessed on a dataset of electronic news, and a dataset drawn from web pages of the ImagEval campaign where it improves by 16% the mean average precision of the baseline.

[1]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[2]  David G. Stork,et al.  Pattern Classification , 1973 .

[3]  Robert M. Haralick,et al.  Recursive X-Y cut using bounding boxes of connected components , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[4]  Hervé Glotin,et al.  Learning optimal visual features from Web sampling in online image retrieval , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Tat-Seng Chua,et al.  Hierarchical Indexing and Flexible Element Retrieval for Structured Document , 2003, ECIR.

[6]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[7]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[8]  Wei-Ying Ma,et al.  Hierarchical clustering of WWW image search results using visual, textual and link information , 2004, MULTIMEDIA '04.

[9]  Michel Scholl,et al.  BlockWeb: An IR Model for Block Structured Web Pages , 2009, 2009 Seventh International Workshop on Content-Based Multimedia Indexing.

[10]  Hervé Glotin,et al.  Indexing by permeability in block structured web pages , 2009, DocEng '09.

[11]  Hervé Glotin,et al.  Web image retrieval on ImagEVAL: evidences on visualness and textualness concept dependency in fusion model , 2007, CIVR '07.

[12]  Beatrice Gralton,et al.  Washington DC - USA , 2008 .

[13]  Jie Zou,et al.  Combining DOM tree and geometric layout analysis for online medical journal article segmentation , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[14]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.

[15]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.

[16]  Hasan Davulcu,et al.  Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge , 2007, World Wide Web.

[17]  Sandip Debnath,et al.  Automatic identification of informative sections of Web pages , 2005, IEEE Transactions on Knowledge and Data Engineering.

[18]  Tom E. Bishop,et al.  Blind Image Restoration Using a Block-Stationary Signal Model , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[19]  Xiaoli Li,et al.  Eliminating noisy information in Web pages for data mining , 2003, KDD '03.