VIPS: a Vision-based Page Segmentation Algorithm

A new web content structure analysis based on visual representation is proposed in this paper. Many web applications such as information retrieval, information extraction and automatic page adaptation can benefit from this structure. This paper presents an automatic top-down, tag-tree independent approach to detect web content structure. It simulates how a user understands web layout structure based on his visual perception. Comparing to other existing techniques such as DOM tree, our approach is independent to the HTML documentation representation. Our method can work well even when the HTML structure is quite different from the visual layout structure. Several experiments show the effectiveness of our method.

[1]  Soumen Chakrabarti,et al.  Accelerated focused crawling through online relevance feedback , 2002, WWW.

[2]  Stephen E. Robertson,et al.  Overview of the Okapi projects , 1997, J. Documentation.

[3]  Baoyao Zhou,et al.  Function-based object model towards website adaptation , 2001, WWW '01.

[4]  Andreas Paepcke,et al.  Accordion summarization for end-game browsing on PDAs and cellular phones , 2001, CHI.

[5]  Yuan Yan Tang,et al.  Document analysis and recognition by computers , 1999 .

[6]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.

[7]  Hongjun Lu,et al.  Toward Learning Based Web Query Processing , 2000, VLDB.

[8]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[9]  Huang Yu Extracting Semi-Structured Information from the WEB , 2000 .

[10]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[11]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[12]  Stephen E. Robertson,et al.  Okapi/Keenbow at TREC-8 , 1999, TREC.

[13]  Michael Bernard,et al.  Criteria for optimal web design (designing for usability) , 2003 .

[14]  A. F. R. Rahman,et al.  Content Extraction from HTML Documents , 2001 .

[15]  James Allan,et al.  Automatic Retrieval With Locality Information Using SMART , 1992, TREC.

[16]  Craig A. Knoblock,et al.  Wrapper generation for semi-structured Internet sources , 1997, SGMD.

[17]  Donald H. Kraft,et al.  Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval , 1998, SIGIR 2002.

[18]  Craig A. Knoblock,et al.  Semi-automatic wrapper generation for Internet information sources , 1997, Proceedings of CoopIS 97: 2nd IFCIS Conference on Cooperative Information Systems.

[19]  Soumen Chakrabarti,et al.  Enhanced topic distillation using text, markup tags, and hyperlinks , 2001, SIGIR '01.

[20]  Timo Laakko,et al.  Two approaches to bringing Internet services to WAP devices , 2000, Comput. Networks.

[21]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[22]  Hector Garcia-Molina,et al.  Extracting Semistructured Information from the Web. , 1997 .

[23]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[24]  Ada Wai-Chee Fu,et al.  Finding Structure and Characteristics of Web Documents for Classification , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[25]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[26]  Peter Bailey,et al.  Engineering a multi-purpose test collection for Web retrieval experiments , 2003, Inf. Process. Manag..

[27]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[28]  Soumen Chakrabarti,et al.  Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction , 2001, WWW '01.