Bootstrapping pay-as-you-go data integration systems

Data integration systems offer a uniform interface to a set of data sources. Despite recent progress, setting up and maintaining a data integration application still requires significant upfront effort of creating a mediated schema and semantic mappings from the data sources to the mediated schema. Many application contexts involving multiple data sources (e.g., the web, personal information management, enterprise intranets) do not require full integration in order to provide useful services, motivating a pay-as-you-go approach to integration. With that approach, a system starts with very few (or inaccurate) semantic mappings and these mappings are improved over time as deemed necessary. This paper describes the first completely self-configuring data integration system. The goal of our work is to investigate how advanced of a starting point we can provide a pay-as-you-go system. Our system is based on the new concept of a probabilistic mediated schema that is automatically created from the data sources. We automatically create probabilistic schema mappings between the sources and the mediated schema. We describe experiments in multiple domains, including 50-800 data sources, and show that our system is able to produce high-quality answers with no human intervention.

[1]  Kevin Chen-Chuan Chang,et al.  Statistical schema matching across web query interfaces , 2003, SIGMOD '03.

[2]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[3]  Wei-Ying Ma,et al.  Instance-based Schema Matching for Web Databases by Domain-specific Query Probing , 2004, VLDB.

[4]  Anthony Kosky,et al.  Theoretical Aspects of Schema Merging , 1992, EDBT.

[5]  Renée J. Miller,et al.  The Use of Information Capacity in Schema Integration and Translation , 1993, VLDB.

[6]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[7]  Jinling Song,et al.  Discovering Complex Semantic Matches between Database Schemas , 2009, 2009 International Conference on Web Information Systems and Mining.

[8]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[9]  Pedro M. Domingos,et al.  Learning to map between ontologies on the semantic web , 2002, WWW '02.

[10]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[11]  Alon Y. Halevy,et al.  Data integration with uncertainty , 2007, The VLDB Journal.

[12]  Felix Schlenk,et al.  Proof of Theorem 3 , 2005 .

[13]  Leonid A. Kalinichenko,et al.  Methods and Tools for Equivalent Data Model Mapping Construction , 1990, EDBT.

[14]  Miroslav Dudík,et al.  Performance Guarantees for Regularized Maximum Entropy Density Estimation , 2004, COLT.

[15]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[16]  Jeffrey F. Naughton,et al.  On schema matching with opaque column names and data values , 2003, SIGMOD '03.

[17]  Avigdor Gal,et al.  Why is schema matching tough and what can we do about it? , 2006, SGMD.

[18]  Umberto Straccia,et al.  Information retrieval and machine learning for probabilistic schema matching , 2005, CIKM '05.

[19]  Richard S. Varga,et al.  Proof of Theorem 5 , 1983 .

[20]  Richard Hull,et al.  Relative information capacity of simple relational database schemata , 1984, SIAM J. Comput..

[21]  David Maier,et al.  From databases to dataspaces: a new abstraction for information management , 2005, SGMD.

[22]  Alon Y. Halevy,et al.  Pay-as-you-go user feedback for dataspace systems , 2008, SIGMOD Conference.

[23]  Pedro M. Domingos,et al.  iMAP: discovering complex semantic matches between database schemas , 2004, SIGMOD '04.

[24]  Matteo Magnani,et al.  Schema Integration Based on Uncertain Semantic Mappings , 2005, ER.

[25]  Amihai Motro,et al.  Database Schema Matching Using Machine Learning with Feature Selection , 2002, CAiSE.

[26]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..