论文信息 - A Web Page Segmentation Method based on Page Layouts and Title Blocks

A Web Page Segmentation Method based on Page Layouts and Title Blocks

Summary In this work, we describe a new Web page segmentation method to extract the semantic structure from a Web page. A typical Web page consists of multiple elements with different functionalities, such as main content, navigation panels, copyright and privacy notices, and advertisements, and Web page segmentation is the division of the page into visually and semantically cohesive pieces. The proposed method is comprised of three steps. First, it determines the layout template of a Web page by template matching. Second, it divides the page into minimum blocks. Third, it assembles groups of these blocks into Web content blocks. While the minimum blocks can play many roles, in this study we have focused on the those that are the titles of various Web content bits. We used decision tree learning with nine parameters for each minimum block to extract the title blocks from Web pages. Experimental results showed that the decision tree generated by the J48 algorithm is the most suitable for this type of extraction.

[1] Andreas Paepcke,et al. Accordion summarization for end-game browsing on PDAs and cellular phones , 2001, CHI.

[2] Jan-Ming Ho,et al. Discovering informative content blocks from Web documents , 2002, KDD.

[3] Wei-Ying Ma,et al. Detecting web page structure for adaptive viewing on small form factor devices , 2003, WWW '03.

[4] Shumeet Baluja,et al. Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework , 2006, WWW '06.

[5] Xing Xie,et al. Collapse-to-zoom: viewing web pages on small screen devices by interactively removing irrelevant content , 2004, UIST '04.