论文信息 - Unexpected results in automatic list extraction on the web

Unexpected results in automatic list extraction on the web

The discovery and extraction of general lists on the Web continues to be an important problem facing theWeb mining community. There have been numerous studies that claim to automatically extract structured data (i.e. lists, record sets, tables, etc.) from the Web for various purposes. Our own recent experiences have shown that the list-finding methods used as part of these larger frameworks do not generalize well and therefore ought to be reevaluated. This paper briefly describes some of the current approaches, and tests them on various list-pages. Based on our findings, we conclude that analyzing aWeb page's DOM-structure is not sufficient for the general list finding task.

Donato Malerba | Jiawei Han | Tim Weninger | Fabio Fumarola | Rick Barber

[1] Robert L. Grossman,et al. Mining data records in Web pages , 2003, KDD '03.

[2] Daisy Zhe Wang,et al. WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[3] Wolfgang Gatterbauer,et al. Towards domain-independent information extraction from web tables , 2007, WWW '07.

[4] Lorenzo Blanco,et al. Flint: Google-basing the Web , 2008, EDBT '08.

[5] SarawagiSunita,et al. Answering table augmentation queries from unstructured lists on the web , 2009, VLDB 2009.

[6] Louise E. Moser,et al. Extracting data records from the web using tag path clustering , 2009, WWW '09.

[7] Wei-Ying Ma,et al. Extracting Content Structure for Web Pages Based on Visual Representation , 2003, APWeb.

[8] Rahul Gupta,et al. Answering Table Augmentation Queries from Unstructured Lists on the Web , 2009, Proc. VLDB Endow..

[9] Valter Crescenzi,et al. RoadRunner: automatic data extraction from data-intensive web sites , 2002, SIGMOD '02.

[10] William W. Cohen,et al. Language-Independent Set Expansion of Named Entities Using the Web , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).