Extracting the semantic content of web pages via repeated structures

Web pages may carry semantics that are very important to the authors and the readers. Due to many reasons, the authors may insert contents that are irrelevant to the underlying semantics of the page to different positions of the page, such as advertizements, guide bars, links. As a result, it may not lead good effect by using all the data of a web page to model its semantics. In this paper, we propose a framework that can extract the real semantic content from web pages via repeated structures of the HTML data. Our algorithm first detect the real semantic blocks in web pages via repeated structure segmentation, then extracts the real semantic content of the pages from real semantic blocks.

[1]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[2]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[3]  Yi Liu,et al.  Combining Tag and Value Similarity for Data Extraction and Alignment , 2012, IEEE Transactions on Knowledge and Data Engineering.

[4]  Xiaoli Li,et al.  Eliminating noisy information in Web pages for data mining , 2003, KDD '03.

[5]  Wolfgang Gatterbauer,et al.  Towards domain-independent information extraction from web tables , 2007, WWW '07.

[6]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[7]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.

[8]  Jiawei Han,et al.  Exploring structure and content on the web: extraction and integration of the semi-structured web , 2013, WSDM '13.

[9]  Sandip Debnath,et al.  Automatic extraction of informative blocks from webpages , 2005, SAC '05.

[10]  Jing Liu,et al.  Automatic extraction of web data records containing user-generated content , 2010, CIKM.

[11]  William W. Cohen,et al.  A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[12]  Donato Malerba,et al.  HyLiEn: a hybrid approach to general list extraction on the web , 2011, WWW.

[13]  Thomas Gottron,et al.  Content Code Blurring: A New Approach to Content Extraction , 2008, 2008 19th International Workshop on Database and Expert Systems Applications.

[14]  Yang Zhang,et al.  Web Data Extraction Based on Simple Tree Matching , 2010, 2010 WASE International Conference on Information Engineering.

[15]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[16]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[17]  Jiawei Han,et al.  CETR: content extraction via tag ratios , 2010, WWW '10.