Semantic HTML Page Segmentation using Type Analysis

Semantic information is necessary for semantic Web processing and is useful to Web adaptation services such as personalization of users' browsing activities on small screen devices. However, semantic information is always implicitly encoded in most existing HTML documents. This paper describes a page segmentation method to parse Web pages into rectangular segments containing some semantic information, namely blocks. Existing page segmentation techniques are mainly built on HTML DOM structure or purely vision based, not accurate enough either in visual presentation or in semantic sense. Our approach is automatic, and based on a refined typing system which tightly couples type analysis with indispensable visual cues to generate blocks into the tree structure, aiming to achieve high degree of coherence in both semantic and visual views. Experimental results show better accuracy and completeness of our method over existing ones

[1]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.

[2]  I. V. Ramakrishnan,et al.  Automatic Annotation of Content-Rich HTML Documents: Structural and Semantic Analysis , 2003, SEMWEB.

[3]  Wei-Ying Ma,et al.  Detecting web page structure for adaptive viewing on small form factor devices , 2003, WWW '03.

[4]  Baoyao Zhou,et al.  Function-based object model towards website adaptation , 2001, WWW '01.

[5]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[6]  I. V. Ramakrishnan,et al.  Bootstrapping semantic annotation for content-rich HTML documents , 2005, 21st International Conference on Data Engineering (ICDE'05).

[7]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[8]  Soumen Chakrabarti,et al.  Enhanced topic distillation using text, markup tags, and hyperlinks , 2001, SIGIR '01.

[9]  Hongjun Lu,et al.  FACT: a learning based Web query processing system , 2000, SIGMOD '00.

[10]  Soumen Chakrabarti,et al.  Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction , 2001, WWW '01.

[11]  I. V. Ramakrishnan,et al.  Browsing fatigue in handhelds: semantic bookmarking spells relief , 2005, WWW '05.

[12]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[13]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[14]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.