Using XPath to Discover Informative Content Blocks of Web Pages

Web pages usually contain various contents, which are relevant or irrelevant with the main topic. We define relevant contents as informative content blocks, whereas irrelevant contents as clutters. Clutters intend to mislead search engines, or trigger an artificially high link-based ranking for specific target pages. So cleaning Web pages before mining becomes critical for improving performance of traditional information retrieval. Here, we propose a method to discover informative content block without supervision. Initially, using a set of sample pages, we adopt a series of rules to distinguish informative content blocks from clutters. Then we generalize public XPath for informative content blocks or clutters, and apply it to similar pages. We have implemented our method in five different Web sites, and output more simpler and centralized HTML file. Experimental result shows that our method can obtain informative content blocks of Web page accurately. And another advantage of our approach is that it is completely automatic.

[1]  Andrew Tomkins,et al.  The volume and evolution of web page templates , 2005, WWW '05.

[2]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[3]  Sandip Debnath,et al.  Automatic extraction of informative blocks from webpages , 2005, SAC '05.

[4]  Barry Smyth,et al.  Fact or Fiction: Content Classification for Digital Libraries , 2001, DELOS.

[5]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[6]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[7]  Kathleen R. McKeown,et al.  Columbia multi-document summarization : Approach and evaluation , 2001 .

[8]  A. F. R. Rahman,et al.  Content Extraction from HTML Documents , 2001 .

[9]  Timo Laakko,et al.  Two approaches to bringing Internet services to WAP devices , 2000, Comput. Networks.

[10]  Gail E. Kaiser,et al.  DOM-based content extraction of HTML documents , 2003, WWW '03.

[11]  Xiaoli Li,et al.  Eliminating noisy information in Web pages for data mining , 2003, KDD '03.

[12]  Richard Nock,et al.  Adaptive filtering of advertisements on web pages , 2005, WWW '05.

[13]  Juliana Freire,et al.  A fast and robust method for web page template detection and removal , 2006, CIKM '06.

[14]  Andreas Paepcke,et al.  Accordion summarization for end-game browsing on PDAs and cellular phones , 2001, CHI.

[15]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.