A Distance-Based Approach to Entity Reconciliation in Heterogeneous Databases

In modern organizations, decision makers must often be able to quickly access information from diverse sources in order to make timely decisions. A critical problem facing many such organizations is the inability to easily reconcile the information contained in heterogeneous data sources. To overcome this limitation, an organization must resolve several types of heterogeneity problems that may exist across different sources. We examine one such problem called the entity heterogeneity problem, which arises when the same real-world entity type is represented using different identifiers in different applications. A decision-theoretic model to resolve the problem is proposed. Our model uses a distance measure to express the similarity between two entity instances. We have implemented the model and tested it on real-world data. The results indicate that the model performs quite well in terms of its ability to predict whether two entity instances should be matched or not. The model is shown to be computationally efficient. It also scales well to large relations from the perspective of the accuracy of prediction. Overall, the test results imply that this is certainly a viable approach in practical situations.

[1]  Amit P. Sheth,et al.  Specifying interdatabase dependencies in a multidatabase environment , 1991, Computer.

[2]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[3]  T. Saaty,et al.  The Analytic Hierarchy Process , 1985 .

[4]  Ali R. Hurson,et al.  Automated resolution of semantic heterogeneity in multidatabases , 1994, TODS.

[5]  Stuart E. Madnick,et al.  The inter-database instance identification problem in integrating autonomous systems , 1989, [1989] Proceedings. Fifth International Conference on Data Engineering.

[6]  Arie Segev,et al.  Rule based joins in heterogeneous databases , 1995, Decis. Support Syst..

[7]  James A. Larson,et al.  A Theory of Attribute Equivalence in Databases with Application to Schema Integration , 1989, IEEE Trans. Software Eng..

[8]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[9]  A. Tversky,et al.  Judgment under Uncertainty: Heuristics and Biases , 1974, Science.

[10]  Robert T. Eckenrode,et al.  Weighting Multiple Criteria , 1965 .

[11]  Arbee L. P. Chen,et al.  Identifying object isomerism in multidatabase systems , 2004, Distributed and Parallel Databases.

[12]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[13]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[14]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[15]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[16]  Sumit Sarkar,et al.  A Probabilistic Decision Model for Entity Matching in Heterogeneous Databases , 1998 .

[17]  Peter C. Lockemann,et al.  System Guided View Integration for Object-Oriented Databases , 1992, IEEE Trans. Knowl. Data Eng..

[18]  A. Tversky,et al.  Judgment under Uncertainty , 1982 .

[19]  Eberhard Stickel,et al.  Data Sharing Economics and Requirements for Integration Tool Design , 1994, Inf. Syst..

[20]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[21]  J B Copas,et al.  Record linkage: statistical models for matching computer records. , 1990, Journal of the Royal Statistical Society. Series A,.

[22]  Erich J. Neuhold,et al.  Knowledge Based Integration of Heterogeneous Databases , 1992, DS-5.

[23]  Bruce E. Barrett,et al.  Decision quality using ranked attribute weights , 1996 .

[24]  Sandra Heiler,et al.  Semantic heterogeneity as a result of domain evolution , 1991, SGMD.

[25]  Jungyun Seo,et al.  Classifying schematic and data heterogeneity in multidatabase systems , 1991, Computer.

[26]  Ming-Chien Shan,et al.  Object Identification in Multidatabase Systems , 1992, DS-5.

[27]  Shamkant B. Navathe,et al.  A Methodology for View Inegration in Logical Database Design , 1982, VLDB.

[28]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[29]  Chris Clifton,et al.  Semantic Integration in Heterogeneous Databases Using Neural Networks , 1994, VLDB.

[30]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[31]  Maurizio Lenzerini,et al.  A Methodology for Data Schema Integration in the Entity Relationship Model , 1984, IEEE Transactions on Software Engineering.