A survey of approaches to automatic schema matching

Abstract. Schema matching is a basic problem in many database application domains, such as data integration, E-business, data warehousing, and semantic query processing. In current implementations, schema matching is typically performed manually, which has significant limitations. On the other hand, previous research papers have proposed many techniques to achieve a partial automation of the match operation for specific application domains. We present a taxonomy that covers many of these existing approaches, and we describe the approaches in some detail. In particular, we distinguish between schema-level and instance-level, element-level and structure-level, and language-based and constraint-based matchers. Based on our classification we review some previous match implementations thereby indicating which part of the solution space they cover. We intend our taxonomy and review of past work to be useful when comparing different approaches to schema matching, when developing a new match algorithm, and when implementing a schema matching component.

[1]  Paul G. Sorenson,et al.  Explaining ambiguity in a formal query language , 1990, TODS.

[2]  Luigi Palopoli,et al.  An automatic technique for detecting type conflicts in database schemes , 1998, CIKM '98.

[3]  Tova Milo,et al.  Using Schema Matching to Simplify Heterogeneous Data Translation , 1998, VLDB.

[4]  Chris Clifton,et al.  Semantic Integration in Heterogeneous Databases Using Neural Networks , 1994, VLDB.

[5]  Laura M. Haas,et al.  Schema Mapping as Query Discovery , 2000, VLDB.

[6]  James A. Larson,et al.  A Theory of Attribute Equivalence in Databases with Application to Schema Integration , 1989, IEEE Trans. Software Eng..

[7]  Barbara Lerner,et al.  A model for compound type changes encountered in schema evolution , 2000, TODS.

[8]  Kaizhong Zhang,et al.  Tree pattern matching , 1997, Pattern Matching Algorithms.

[9]  Amihai Motro,et al.  Autoplex: Automated Discovery of Content for Virtual Databases , 2001, CoopIS.

[10]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[11]  Silvana Castano,et al.  A schema analysis and reconciliation tool environment for heterogeneous databases , 1999, Proceedings. IDEAS'99. International Database Engineering and Applications Symposium (Cat. No.PR00265).

[12]  Silvana Castano,et al.  Information Integration: The MOMIS Project Demonstration , 2000, VLDB.

[13]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[14]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[15]  Chris Clifton,et al.  SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks , 2000, Data Knowl. Eng..

[16]  Laura M. Haas,et al.  Data-driven understanding and refinement of schema mappings , 2001, SIGMOD '01.

[17]  Chris Clifton,et al.  Database Integration Using Neural Networks: Implementation and Experiences , 2000, Knowledge and Information Systems.

[18]  Renée J. Miller,et al.  Schema equivalence in heterogeneous systems: bridging theory and practice , 1994, Inf. Syst..

[19]  Stefano Spaccapietra,et al.  Issues and approaches of database integration , 1998, CACM.

[20]  Giorgio Terracina,et al.  Deriving synonymies and homonymies of object classes in semi-structured information sources , 2000 .

[21]  Domenico Rosaci,et al.  Deriving "Sub-source" Similarities from Heterogeneous, Semi-structured Information Sources , 2001, CoopIS.

[22]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[23]  Silvana Castano,et al.  Global Viewing of Heterogeneous Data Sources , 2001, IEEE Trans. Knowl. Data Eng..

[24]  David Maier,et al.  Toward logical data independence: a relational query language without relations , 1982, SIGMOD '82.

[25]  Laura M. Haas,et al.  Transforming Heterogeneous Data with Database Middleware: Beyond Integration , 1999, IEEE Data Eng. Bull..

[26]  Anil Sethi,et al.  Matching records in a national medical patient index , 2001, CACM.

[27]  Silvana Castano,et al.  Semantic integration of semistructured and structured data sources , 1999, SGMD.

[28]  Luigi Palopoli,et al.  Semi-automatic, semantic discovery of properties from database schemes , 1998, Proceedings. IDEAS'98. International Database Engineering and Applications Symposium (Cat. No.98EX156).

[29]  Matthias Jarke,et al.  Panel: Is Generic Metadata Management Feasible? , 2000, VLDB.

[30]  Luigi Palopoli,et al.  A unified graph-based framework for deriving nominal interscheme properties, type conflicts and object cluster similarities , 1999, Proceedings Fourth IFCIS International Conference on Cooperative Information Systems. CoopIS 99 (Cat. No.PR00384).

[31]  Kaizhong Zhang,et al.  Fast Serial and Parallel Algorithms for Approximate Tree Matching with VLDC's , 1992, CPM.

[32]  Prasenjit Mitra,et al.  Semi-automatic Integration of Knowledge Sources , 1999 .

[33]  Luigi Palopoli,et al.  The System DIKE: Towards the Semi-Automatic Synthesis of Cooperative Information Systems and Data Warehouses , 2000, ADBIS-DASFAA Symposium.

[34]  Kaizhong Zhang,et al.  Approximate tree pattern matching , 1997 .

[35]  Ali R. Hurson,et al.  Automated resolution of semantic heterogeneity in multidatabases , 1994, TODS.

[36]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[37]  Erhard Rahm,et al.  Data Warehouse Scenarios for Model Management , 2000, ER.

[38]  Pedro M. Domingos,et al.  Learning Source Descriptions for Data Integration , 2000 .

[39]  Martin L. Kersten,et al.  A Graph-Oriented Model for Articulation of Ontology Interdependencies , 1999, EDBT.

[40]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[41]  Philip A. Bernstein,et al.  A vision for management of complex models , 2000, SGMD.

[42]  Kam-Fai Wong,et al.  Approximate Graph Schema Extraction for Semi-Structured Data , 2000, EDBT.

[43]  Kaizhong Zhang,et al.  A System for Approximate Tree Matching , 1994, IEEE Trans. Knowl. Data Eng..

[44]  Jeffrey D. Ullman,et al.  SYSTEM/U: a database system based on the universal relation assumption , 1984, TODS.

[45]  Chris Clifton,et al.  Experience with a Combined Approach to Attribute-Matching Across Heterogeneous Databases , 1997, DS-7.

[46]  Rukshan Athauda,et al.  Semantic Access: Semantic Interface for Querying Databases , 2000, VLDB.

[47]  Calton Pu,et al.  Guest Editors' Introduction to the Special Issue on Heterogeneous Databases , 1990, ACM Computing Surveys.

[48]  David W. Embley,et al.  Multifaceted Exploitation of Metadata for Attribute Match Discovery in Information Integration , 2001, Workshop on Information Integration on the Web.

[49]  Silvana Castano,et al.  Semantic integration of heterogeneous information sources , 2001, Data Knowl. Eng..

[50]  Laura M. Haas,et al.  The Clio project: managing heterogeneity , 2001, SGMD.

[51]  Domenico Ursino,et al.  Extraction and Exploitation of Intensional Knowledge from Heterogeneous Information Sources , 2002, Lecture Notes in Computer Science.