Automatic Extraction of Structured Web Data with Domain Knowledge

We present in this paper a novel approach for extracting structured data from the Web, whose goal is to harvest real-world items from template-based HTML pages (the structured Web). It illustrates a two-phase querying of the Web, in which an intentional description of the data that is targeted is first provided, in a flexible and widely applicable manner. The extraction process leverages then both the input description and the source structure. Our approach is domain-independent, in the sense that it applies to any relation, either flat or nested, describing real-world items. Extensive experiments on five different domains and comparison with the main state of the art extraction systems from literature illustrate its flexibility and precision. We advocate via our technique that automatic extraction and integration of complex structured data can be done fast and effectively, when the redundancy of the Web meets knowledge over the to-be-extracted data.

[1]  Frederick H. Lochovsky,et al.  Data extraction and label assignment for web databases , 2003, WWW '03.

[2]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[3]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[4]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[5]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[6]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[7]  Wei-Ying Ma,et al.  Extracting Content Structure for Web Pages Based on Visual Representation , 2003, APWeb.

[8]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[9]  Georg Gottlob,et al.  The Lixto data extraction project: back and forth between theory and practice , 2004, PODS.

[10]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[11]  Doug Downey,et al.  KnowItNow: Fast, Scalable Information Extraction from the Web , 2005, HLT.

[12]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[13]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[14]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[15]  Talel Abdessalem,et al.  ObjectRunner: Lightweight, Targeted Extraction and Querying of Structured Web Data , 2010, Proc. VLDB Endow..

[16]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[17]  Shui-Lung Chuang,et al.  Context-Aware Wrapping: Synchronized Data Extraction , 2007, VLDB.

[18]  David R. Karger,et al.  Thresher: automating the unwrapping of semantic content from the World Wide Web , 2005, WWW '05.

[19]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[20]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[21]  Bing Liu,et al.  A Generalized Tree Matching Algorithm Considering Nested Lists for Web Data Extraction , 2010, SDM.

[22]  Chia-Hui Chang,et al.  Olera: semisupervised Web-data extraction with visual support , 2004, IEEE Intelligent Systems.

[23]  Pierre Senellart,et al.  Automatic wrapper induction from hidden-web sources with domain knowledge , 2008, WIDM '08.

[24]  Michael J. Cafarella,et al.  Ontology-driven, unsupervised instance population , 2008, J. Web Semant..

[25]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[26]  Weifeng Su,et al.  ODE: Ontology-assisted data extraction , 2009, TODS.

[27]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[28]  Wei-Ying Ma,et al.  Simultaneous record detection and attribute labeling in web data extraction , 2006, KDD '06.

[29]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.