Entity matching in heterogeneous databases: a distance-based decision model

The need to leverage the information contained in heterogeneous data sources has been widely documented. In order to accomplish this goal, an organization must resolve several types of heterogeneity problems that may exist across different data sources. We investigate one such problem called the entity heterogeneity problem. This problem arises when the same real-world entity type is represented using different identifiers in different applications. We propose a decision theoretic model to resolve the problem. Our model uses a distance-based measure to express the similarity between two entity instances. We have implemented the model, and our experimental results indicate that this is a viable approach in real-world situations.