Accurate Schema Matching on Streams

We address the problem of matching imperfectly documented schemas of data streams and large databases. Instance-level schema matching algorithms identify likely correspondences between attributes by quantifying the similarity of their corresponding values. However, exact calculation of these similarities requires processing of all database records—which is infeasible for data streams. We devise a fast matching algorithm that uses only a small sample of records, and is yet guaranteed to match the most similar attributes with high probability. The method can be applied to any given (combination of) similarity metrics that can be estimated from a sample with bounded error; we apply the algorithm to several metrics. We give a rigorous proof of the method’s correctness and report on experiments using large databases.

[1]  Chris Clifton,et al.  Semantic Integration in Heterogeneous Databases Using Neural Networks , 1994, VLDB.

[2]  Osamu Watanabe,et al.  Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms , 1999, Discovery Science.

[3]  Erhard Rahm,et al.  Data Warehouse Scenarios for Model Management , 2000, ER.

[4]  Pedro M. Domingos,et al.  Learning Source Descriptions for Data Integration , 2000 .

[5]  Chris Clifton,et al.  SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks , 2000, Data Knowl. Eng..

[6]  Laura M. Haas,et al.  Data-driven understanding and refinement of schema mappings , 2001, SIGMOD '01.

[7]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[8]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[9]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[10]  Stefan Wrobel,et al.  Finding the Most Interesting Patterns in a Database Quickly by Using Sequential Sampling , 2003, J. Mach. Learn. Res..

[11]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[12]  Pedro M. Domingos,et al.  iMAP: discovering complex semantic matches between database schemas , 2004, SIGMOD '04.

[13]  Arnon Rosenthal,et al.  Tuning Schema Matching Software using Synthetic Scenarios , 2005, VLDB.

[14]  Szymon Jaroszewicz,et al.  Fast discovery of unexpected patterns in data, relative to a Bayesian network , 2005, KDD '05.

[15]  Erhard Rahm,et al.  Schema and ontology matching with COMA++ , 2005, SIGMOD '05.

[16]  Qian Ying Discovering Complex Semantic Matches Between Database Schemas , 2008 .