Extracting general lists from web documents: a hybrid approach

The problem of extracting structured data (i.e. lists, record sets, tables, etc.) from the Web has been traditionally approached by taking into account either the underlying markup structure of a Web page or the visual structure of the Web page. However, empirical results show that considering the HTML structure and visual cues of a Web page independently do not generalize well. We propose a new hybrid method to extract general lists from the Web. It employs both general assumptions on the visual rendering of lists, and the structural representation of items contained in them. We show that our method significantly outperforms existing methods across a varied Web corpus.

[1]  Wolfgang Gatterbauer,et al.  Towards domain-independent information extraction from web tables , 2007, WWW '07.

[2]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[3]  Hakon Wium Lie,et al.  Cascading Style Sheets: Designing for the Web , 1997 .

[4]  Pabitra Mitra,et al.  Extracting semantic structure of web documents using content and visual information , 2005, WWW '05.

[5]  Wei Liu,et al.  ViDE: A Vision-Based Approach for Deep Web Data Extraction , 2010, IEEE Transactions on Knowledge and Data Engineering.

[6]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[7]  Rahul Gupta,et al.  Answering Table Augmentation Queries from Unstructured Lists on the Web , 2009, Proc. VLDB Endow..

[8]  Wei-Ying Ma,et al.  Extracting Content Structure for Web Pages Based on Visual Representation , 2003, APWeb.

[9]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[10]  Louise E. Moser,et al.  Extracting data records from the web using tag path clustering , 2009, WWW '09.

[11]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[12]  Valter Crescenzi,et al.  RoadRunner: automatic data extraction from data-intensive web sites , 2002, SIGMOD '02.

[13]  Kristina Lerman,et al.  Using the structure of Web sites for automatic segmentation of tables , 2004, SIGMOD '04.

[14]  Craig A. Knoblock,et al.  Automatic Data Extraction from Lists and Tables in Web Sources , 2001 .

[15]  Donato Malerba,et al.  Unexpected results in automatic list extraction on the web , 2011, SKDD.

[16]  William W. Cohen,et al.  Language-Independent Set Expansion of Named Entities Using the Web , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).