Identifying Content Blocks from Web Documents

Intelligent information processing systems, such as digital libraries or search engines index web-pages according to their informative content. However, web-pages contain several non-informative contents, e.g., navigation sidebars, advertisements, copyright notices, etc. It is very important to separate the informative “primary content blocks” from these non-informative blocks. In this paper, two algorithms, FeatureExtractor and K-FeatureExtractor are proposed to identify the “primary content blocks” based on their features. None of these algorithms require any supervised learning, but still can identify the “primary content blocks” with high precision and recall. While operating on several thousand web-pages obtained from 15 different websites, our algorithms significantly outperform the Entropy-based algorithm proposed by Lin and Ho [14] in both precision and run-time.

[1]  Maarten de Rijke,et al.  Wrapper Generation via Grammar Induction , 2000, ECML.

[2]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[3]  Xiaoli Li,et al.  Eliminating noisy information in Web pages for data mining , 2003, KDD '03.

[4]  Craig A. Knoblock,et al.  Ariadne: a system for constructing mediators for Internet sources , 1998, SIGMOD '98.

[5]  Paolo Merialdo,et al.  Semistructured and structured data in the Web: going back and forth , 1997, SGMD.

[6]  Bing Liu,et al.  Visualizing web site comparisons , 2002, WWW '02.

[7]  Jennifer Widom,et al.  The TSIMMIS Project: Integration of Heterogeneous Information Sources , 1994, IPSJ.

[8]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[9]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[10]  Divesh Srivastava,et al.  Data model and query evaluation in global information systems , 1995, Journal of Intelligent Information Systems.

[11]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.

[12]  Chun-Nan Hsu Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers and Contextual Rules , 1998 .

[13]  William W. Cohen A Web-based information system that reasons with structured collections of text , 1998, AGENTS '98.

[14]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[15]  Sandip Debnath,et al.  Automatic extraction of informative blocks from webpages , 2005, SAC '05.

[16]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[17]  Divesh Srivastava,et al.  The Information Manifold , 1995 .

[18]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[19]  Enric Plaza,et al.  Machine Learning: ECML 2000 , 2003, Lecture Notes in Computer Science.