An Uncertain Data Integration System

Data integration systems offer uniform access to a set of autonomous and heterogeneous data sources. An important task in setting up a data integration system is to match the attributes of the source schemas. In this paper, we propose a data integration system which uses the knowledge implied within functional dependencies for matching the source schemas. We build our system on a probabilistic data model to capture the uncertainty arising during the matching process. Our performance validation confirms the importance of functional dependencies and also using a probabilistic data model in improving the quality of schema matching. Our experimental results show significant performance gain compared to the baseline approaches. They also show that our system scales well.

[1]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[2]  Daisy Zhe Wang,et al.  Functional Dependency Generation and Applications in Pay-As-You-Go Data Integration Systems , 2009, WebDB.

[3]  Alon Y. Halevy,et al.  Data integration with uncertainty , 2007, The VLDB Journal.

[4]  James A. Larson,et al.  A Theory of Attribute Equivalence in Databases with Application to Schema Integration , 1989, IEEE Trans. Software Eng..

[5]  Alon Y. Halevy,et al.  Bootstrapping pay-as-you-go data integration systems , 2008, SIGMOD Conference.

[6]  Patrick Valduriez,et al.  Efficient Evaluation of SUM Queries over Probabilistic Data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[7]  S. S. Ravi,et al.  Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results , 2009, Data Mining and Knowledge Discovery.

[8]  Hamideh Afsarmanesh,et al.  Semi-automated schema integration with SASMINT , 2009, Knowledge and Information Systems.

[9]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[10]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[11]  Luigi Palopoli,et al.  DIKE: a system supporting the semi‐automatic construction of cooperative information systems from heterogeneous databases , 2003, Softw. Pract. Exp..

[12]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[13]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[14]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[15]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[16]  Joachim Biskup,et al.  Extracting information from heterogeneous information sources using ontologically specified target views , 2003, Inf. Syst..

[17]  Hannu Toivonen,et al.  TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies , 1999, Comput. J..

[18]  Patrick Valduriez,et al.  Principles of Distributed Database Systems, Third Edition , 2011 .

[19]  Fernando Gustavo Tinetti,et al.  Principles of distributed database systems, third edition , 2014 .

[20]  Anupam Bhattacharjee,et al.  OntoMatch: A monotonically improving schema matching system for autonomous data integration , 2009, 2009 IEEE International Conference on Information Reuse & Integration.