Extraction and Integration of Partially Overlapping Web Sources

We present an unsupervised approach for harvesting the data exposed by a set of structured and partially overlapping data-intensive web sources. Our proposal comes within a formal framework tackling two problems: the data extraction problem, to generate extraction rules based on the input websites, and the data integration problem, to integrate the extracted data in a unified schema. We introduce an original algorithm, WEIR, to solve the stated problems and formally prove its correctness. WEIR leverages the overlapping data among sources to make better decisions both in the data extraction (by pruning rules that do not lead to redundant information) and in the data integration (by reflecting local properties of a source over the mediated schema). Along the way, we characterize the amount of redundancy needed by our algorithm to produce a solution, and present experimental results to show the benefits of our approach with respect to existing solutions.

[1]  Valter Crescenzi,et al.  WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES , 2008, Appl. Artif. Intell..

[2]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[3]  Kristina Lerman,et al.  Automatically Constructing Semantic Web Services from Online Sources , 2009, SEMWEB.

[4]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[5]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[6]  Raghu Ramakrishnan,et al.  Toward best-effort information extraction , 2008, SIGMOD Conference.

[7]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[8]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[9]  Rajeev Rastogi,et al.  Web-scale information extraction with vertex , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[10]  Jayant Madhavan,et al.  Harvesting relational tables from lists on the web , 2009, The VLDB Journal.

[11]  Erhard Rahm,et al.  Generic schema matching, ten years later , 2011, Proc. VLDB Endow..

[12]  Denilson Barbosa,et al.  Labeling Data Extracted from the Web , 2007, OTM Conferences.

[13]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[14]  Alon Y. Halevy,et al.  Data Integration for the Relational Web , 2009, Proc. VLDB Endow..

[15]  Ashwin Machanavajjhala,et al.  An Analysis of Structured Data on the Web , 2012, Proc. VLDB Endow..

[16]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[17]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[18]  Ravi Kumar,et al.  Automatic Wrappers for Large Scale Web Extraction , 2011, Proc. VLDB Endow..

[19]  Qiang Hao,et al.  From one tree to a forest: a unified solution for structured web data extraction , 2011, SIGIR.

[20]  Anuj R. Jaiswal,et al.  Uninterpreted Schema Matching with Embedded Value Mapping under Opaque Column Names and Data Values , 2010, IEEE Transactions on Knowledge and Data Engineering.

[21]  Lorenzo Blanco,et al.  Supporting the automatic construction of entity aware search engines , 2008, WIDM '08.

[22]  Lorenzo Blanco,et al.  Probabilistic Models to Reconcile Complex Data from Inaccurate Data Sources , 2010, CAiSE.

[23]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[24]  Craig A. Knoblock,et al.  Learning Semantic Definitions of Online Information Sources , 2007, J. Artif. Intell. Res..

[25]  Divesh Srivastava,et al.  Truth Finding on the Deep Web: Is the Problem Solved? , 2012, Proc. VLDB Endow..

[26]  Wai Lam,et al.  Learning to Adapt Web Information Extraction Knowledge and Discovering New Attributes via a Bayesian Approach , 2010, IEEE Transactions on Knowledge and Data Engineering.

[27]  Loredana Afanasiev,et al.  Harnessing the Deep Web: Present and Future , 2009, CIDR.

[28]  Shui-Lung Chuang,et al.  Context-Aware Wrapping: Synchronized Data Extraction , 2007, VLDB.

[29]  CrescenziValter,et al.  Extraction and integration of partially overlapping web sources , 2013, VLDB 2013.

[30]  Jeffrey F. Naughton,et al.  On schema matching with opaque column names and data values , 2003, SIGMOD '03.

[31]  Rajeev Rastogi,et al.  Exploiting content redundancy for web information extraction , 2010, Proc. VLDB Endow..

[32]  Oren Etzioni,et al.  Open Information Extraction: The Second Generation , 2011, IJCAI.