Schema Matching Using Interattribute Dependencies

Schema matching is one of the key challenges in information integration. It is a labor-intensive and time-consuming process. To alleviate the problem, many automated solutions have been proposed. Most of the existing solutions mainly rely upon textual similarity of the data to be matched. However, there exist instances of the schema matching problem for which they do not even apply. Such problem instances typically arise when the column names in the schemas and the data in the columns are opaque or very difficult to interpret. In our previous work [36] we proposed a two-step technique to address this problem. In the first step, we measure the dependencies between attributes within tables using an information-theoretic measure and construct a dependency graph for each table capturing the dependencies among attributes. In the second step, we find matching node pairs across the dependency graphs by running a graph matching algorithm. In our previous work, we experimentally validated the accuracy of the approach. One remaining challenge is the computational complexity of the graph matching problem in the second step. In this paper we extend the previous work by improving the second phase of the algorithm incorporating efficient approximation algorithms into the framework.

[1]  Ronald Fagin,et al.  Data exchange: semantics and query answering , 2003, Theor. Comput. Sci..

[2]  Salih O. Duffuaa,et al.  A Linear Programming Approach for the Weighted Graph Matching Problem , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Philip A. Bernstein,et al.  Incremental schema matching , 2006, VLDB.

[4]  Pedro M. Domingos,et al.  iMAP: discovering complex semantic matches between database schemas , 2004, SIGMOD '04.

[5]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[6]  Pedro M. Domingos,et al.  Learning Source Description for Data Integration , 2000, WebDB.

[7]  Andrew B. Whinston,et al.  Model management , 1994 .

[8]  Chris Clifton,et al.  Semantic Integration in Heterogeneous Databases Using Neural Networks , 1994, VLDB.

[9]  Laura M. Haas,et al.  Schema Mapping as Query Discovery , 2000, VLDB.

[10]  Raghu Ramakrishnan,et al.  Conjunctive query equivalence of keyed relational schemas (extended abstract) , 1997, PODS '97.

[11]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[12]  Renée J. Miller,et al.  Schema equivalence in heterogeneous systems: bridging theory and practice , 1994, Inf. Syst..

[13]  Richard Hull,et al.  Relative information capacity of simple relational database schemata , 1984, SIAM J. Comput..

[14]  Carlo Batini,et al.  Inclusion and Equivalence between Relational Database Schemata , 1982, Theor. Comput. Sci..

[15]  Carmel Domshlak,et al.  Rank Aggregation for Automatic Schema Matching , 2007, IEEE Transactions on Knowledge and Data Engineering.

[16]  Panos M. Pardalos,et al.  Quadratic Assignment Problem , 1997, Encyclopedia of Optimization.

[17]  Hannu Toivonen,et al.  Efficient discovery of functional and approximate dependencies using partitions , 1998, Proceedings 14th International Conference on Data Engineering.

[18]  Tova Milo,et al.  Using Schema Matching to Simplify Heterogeneous Data Translation , 1998, VLDB.

[19]  Kurt M. Anstreicher,et al.  A new bound for the quadratic assignment problem based on convex quadratic programming , 2001, Math. Program..

[20]  Christoph Schnörr,et al.  Evaluation of Convex Optimization Techniques for the Weighted Graph-Matching Problem in Computer Vision , 2001, DAGM-Symposium.

[21]  Steven Gold,et al.  A Graduated Assignment Algorithm for Graph Matching , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Laura M. Haas,et al.  Data-driven understanding and refinement of schema mappings , 2001, SIGMOD '01.

[23]  Ronald Fagin,et al.  Quasi-inverses of schema mappings , 2007, PODS '07.

[24]  G KolaitisPhokion,et al.  Composing schema mappings , 2005 .

[25]  Heikki Mannila,et al.  Dependency Inference , 1987, VLDB.

[26]  Philip A. Bernstein,et al.  Implementing mapping composition , 2007, The VLDB Journal.

[27]  Amihai Motro,et al.  Database Schema Matching Using Machine Learning with Feature Selection , 2002, CAiSE.

[28]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[29]  K. Rose Deterministic annealing for clustering, compression, classification, regression, and related optimization problems , 1998, Proc. IEEE.

[30]  Jiawei Han,et al.  Discovering complex matchings across web query interfaces: a correlation mining approach , 2004, KDD.

[31]  Silvana Castano,et al.  Global Viewing of Heterogeneous Data Sources , 2001, IEEE Trans. Knowl. Data Eng..

[32]  Shin Ishii,et al.  Doubly constrained network for combinatorial optimization , 2002, Neurocomputing.

[33]  Katta G. Murty,et al.  Operations Research: Deterministic Optimization Models , 1994 .

[34]  Philip A. Bernstein,et al.  Compiling mappings to bridge applications and databases , 2007, SIGMOD '07.

[35]  Jeffrey F. Naughton,et al.  On schema matching with opaque column names and data values , 2003, SIGMOD '03.

[36]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[37]  Philip A. Bernstein,et al.  A vision for management of complex models , 2000, SGMD.

[38]  Nir Friedman,et al.  Learning Bayesian Network Structure from Massive Datasets: The "Sparse Candidate" Algorithm , 1999, UAI.

[39]  Laura M. Haas,et al.  Clio: a semi-automatic tool for schema mapping , 2001, SIGMOD '01.

[40]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[41]  Laura M. Haas,et al.  Clio grows up: from research prototype to industrial tool , 2005, SIGMOD '05.

[42]  Heikki Mannila,et al.  Approximate Inference of Functional Dependencies from Relations , 1995, Theor. Comput. Sci..

[43]  Knud D. Andersen,et al.  The Mosek Interior Point Optimizer for Linear Programming: An Implementation of the Homogeneous Algorithm , 2000 .

[44]  AnHai Doan,et al.  iMAP: Discovering Complex Mappings between Database Schemas. , 2004, SIGMOD 2004.

[45]  Philip A. Bernstein,et al.  Model management 2.0: manipulating richer mappings , 2007, SIGMOD '07.

[46]  Erhard Rahm,et al.  On Matching Schemas Automatically , 2001 .

[47]  Stephen J. Wright Primal-Dual Interior-Point Methods , 1997, Other Titles in Applied Mathematics.

[48]  Catriel Beeri,et al.  Equivalence of relational database schemes , 1979, SIAM J. Comput..

[49]  Petr Berka PKDD 2001 Discovery Challenge on Thrombosis Data , 2001 .

[50]  Jorma Rissanen On equivalences of database schemes , 1982, PODS '82.

[51]  Chris Clifton,et al.  SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks , 2000, Data Knowl. Eng..

[52]  Ronald Fagin,et al.  Composing schema mappings: second-order dependencies to the rescue , 2004, PODS '04.

[53]  Ben Taskar,et al.  Selectivity estimation using probabilistic models , 2001, SIGMOD '01.

[54]  Yannis E. Ioannidis,et al.  Conjunctive Query Equivalence of Keyed Relational Schemas. , 1997, PODS 1997.

[55]  Shinji Umeyama,et al.  An Eigendecomposition Approach to Weighted Graph Matching Problems , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[56]  Kevin Chen-Chuan Chang,et al.  Making holistic schema matching robust: an ensemble approach , 2005, KDD '05.

[57]  Jérôme Euzenat,et al.  A Survey of Schema-Based Matching Approaches , 2005, J. Data Semant..

[58]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[59]  Renée J. Miller,et al.  The Use of Information Capacity in Schema Integration and Translation , 1993, VLDB.

[60]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[61]  Philip A. Bernstein,et al.  ModelGen: model independent schema translation , 2005, 21st International Conference on Data Engineering (ICDE'05).