A Probabilistic Decision Model for Entity Matching in Heterogeneous Databases

In recent years, there has been a proliferation of database systems in all types of organizations. In many cases, these databases are developed in different departments and maintained autonomously. Much is to be gained, however, if databases across departments, divisions, or even organizations can be related to one another. One main problem of relating data stored in different databases is the differences in their representation of real-world entities, such as the use of different identifiers or primary keys. We present a decision theoretic model for matching entities across different databases. The decision to match two entities from two different databases inherently involves some uncertainty since an exact match may not be found because of errors in data collection, data entry, and data representation. We model this uncertainty using probability theory and propose an integer programming formulation that minimizes the total cost associated with the entity matching decision. The model has been implemented and validated on real-world data.

[1]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[2]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[3]  Umeshwar Dayal,et al.  View Definition and Generalization for Database Integration in a Multidatabase System , 1984, IEEE Transactions on Software Engineering.

[4]  C. Batini,et al.  A comparative analysis of methodologies for database schema integration , 1986, CSUR.

[5]  Haim Mendelson,et al.  Incomplete information costs and database design , 1986, TODS.

[6]  A. Sheth Federated database systems for managing distributed, heterogeneous, and autonomous databases , 1990, CSUR.

[7]  J B Copas,et al.  Record linkage: statistical models for matching computer records. , 1990, Journal of the Royal Statistical Society. Series A,.

[8]  Amit P. Sheth,et al.  Specifying interdatabase dependencies in a multidatabase environment , 1991, Computer.

[9]  Stuart E. Madnick,et al.  A Metadata Approach to Resolving Semantic Conflicts , 2011, VLDB.

[10]  Jungyun Seo,et al.  Classifying schematic and data heterogeneity in multidatabase systems , 1991, Computer.

[11]  Doug Fang,et al.  The identification and resolution of semantic heterogeneity in multidatabase systems , 1991, [1991] Proceedings. First International Workshop on Interoperability in Multidatabase Systems.

[12]  Katta G. Murty,et al.  Network programming , 1992 .

[13]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[14]  Ali R. Hurson,et al.  Automated resolution of semantic heterogeneity in multidatabases , 1994, TODS.

[15]  Chris Clifton,et al.  Semantic Integration in Heterogeneous Databases Using Neural Networks , 1994, VLDB.

[16]  Arie Segev,et al.  Rule based joins in heterogeneous databases , 1995, Decis. Support Syst..

[17]  Sumit Sarkar,et al.  A probabilistic relational model and algebra , 1996, TODS.