Clustering Schema Elements for Semantic Integration of Heterogeneous Data Sources

Interschema relationship identification (IRI), that is, determining the relationships among schema elements in heterogeneous data sources, is an important step in integrating the data sources. This article proposes a cluster analysis based approach to semi-automating the IRI process, which is typically very time-consuming and requires extensive human interaction. The authors apply multiple clustering techniques, including K-means, hierarchical clustering, and self-organizing map (SOM) neural network, to identify similar schema elements from heterogeneous data sources, based on a combination of features such as naming similarity, document similarity, schema specification, data patterns, and usage patterns. An SOM prototype the authors have developed provides users with a visualization tool for display of clustering results as well as for incremental evaluation of candidate similar elements.

[1]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[2]  Sudha Ram,et al.  Schema integration: past, present, and future , 1998 .

[3]  Graham A. Stephen String Searching Algorithms , 1994, Lecture Notes Series on Computing.

[4]  Hongjun Lu,et al.  Discovering and Reconciling Semantic Conflicts: A Data Mining Perspective , 1997, DS-7.

[5]  David West,et al.  A comparison of SOM neural network and hierarchical clustering methods , 1996 .

[6]  Luigi Palopoli,et al.  Intensional and extensional integration and abstraction of heterogeneous databases , 2000, Data Knowl. Eng..

[7]  Barry Eaglestone,et al.  Semantic Based Schema Analysis , 1998, DEXA.

[8]  Amit P. Sheth,et al.  Management of heterogeneous and autonomous database systems , 1998 .

[9]  J. Hanley,et al.  A method of comparing the areas under receiver operating characteristic curves derived from the same cases. , 1983, Radiology.

[10]  Chris Clifton,et al.  SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks , 2000, Data Knowl. Eng..

[11]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[12]  Tamás D. Gedeon,et al.  Managing heterogeneous information systems through discovery and retrieval of generic concepts , 2000, J. Am. Soc. Inf. Sci..

[13]  Chris Clifton,et al.  Experience with a Combined Approach to Attribute-Matching Across Heterogeneous Databases , 1997, DS-7.

[14]  Stephen Hayne,et al.  Multi-user view integration system (MUVIS): an expert system for view integration , 1990, [1990] Proceedings. Sixth International Conference on Data Engineering.

[15]  Christian Huemer,et al.  Automatic Classification of Semantic Concepts in View Specifications , 1996, DEXA.

[16]  Max J. Egenhofer,et al.  Asessing Semnatic Similarities among Geospatial Feature Class Definitions , 1999, INTEROP.

[17]  Isabelle Mirbel,et al.  Semantic Integration of Conceptual Schemas , 1997, Data Knowl. Eng..

[18]  Ali R. Hurson,et al.  Automated resolution of semantic heterogeneity in multidatabases , 1994, TODS.

[19]  Shamkant B. Navathe,et al.  A Model to Support E-Catalog Integration , 2001, DS-9.

[20]  José Alfredo Ferreira Costa,et al.  Estimating the Number of Clusters in Multivariate Data by Self-Organizing Maps , 1999, Int. J. Neural Syst..

[21]  Elisabeth Métais,et al.  The Linguistic Level: Contribution for Conceptual Design, View Integration, Reuse and Documentation , 1997, Data Knowl. Eng..

[22]  Paul Johannesson Supporting Schema Integration by Linguistic Instruments , 1997, Data Knowl. Eng..

[23]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[24]  Sudha Ram,et al.  Entity identification for heterogeneous database integration--a multiple classifier system approach and empirical evaluation , 2005, Inf. Syst..

[25]  Hongjun Lu,et al.  Discovering and reconciling value conflicts for numerical data integration , 2001, Inf. Syst..

[26]  Janis A. Bubenko,et al.  Semantic Similarity Relations and Computation in Schema Integration , 1996, Data Knowl. Eng..