Entity identification for heterogeneous database integration--a multiple classifier system approach and empirical evaluation

Entity identification, i.e., detecting semantically corresponding records from heterogeneous data sources, is a critical step in integrating the data sources. The objective of this research is to develop and evaluate a novel multiple classifier system approach that improves entity identification accuracy. We apply various classification techniques drawn from statistical pattern recognition, machine learning, and artificial neural networks to determine whether two records from different data sources represent the same real-world entity. We further employ a variety of ways to combine multiple classifiers for improved classification accuracy. In this paper, we report on some promising empirical results that demonstrate performance improvement by combining multiple classifiers.

[1]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[2]  Howard B. Newcombe,et al.  Handbook of record linkage: methods for health and statistical studies, administration, and business , 1988 .

[3]  Jaideep Srivastava,et al.  Mining Entity-Identification Rules for Database Integration , 1996, KDD.

[4]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[5]  Sumit Sarkar,et al.  Entity matching in heterogeneous databases: a distance-based decision model , 1998, Proceedings of the Thirty-First Hawaii International Conference on System Sciences.

[6]  João Gama,et al.  Cascade Generalization , 2000, Machine Learning.

[7]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[8]  Ming‐Pi Mi Handbook of record linkage: Methods for health and statistical studies, administration, and business, Howard B. Newcombe, Oxford, England: Oxford University Press, 1988, 210 pp, $40.00 , 1989 .

[9]  Ira J. Haimowitz,et al.  Integrating and Mining Distributed Customer Databases , 1997, KDD.

[10]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[11]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[12]  Julius T. Tou,et al.  Information Systems , 1973, GI Jahrestagung.

[13]  Don X. Sun,et al.  Methods for Linking and Mining Massive Heterogeneous Databases , 1998, KDD.

[14]  Ahmed K. Elmagarmid,et al.  Automating the approximate record-matching process , 2000, Inf. Sci..

[15]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[16]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[17]  Arie Segev,et al.  A Framework for Object Matching in Federated Databases and Its Implementation , 1996, Int. J. Cooperative Inf. Syst..

[18]  Sudha Ram,et al.  Clustering Database Objects for Semantic Integration of Heterogeneous Databases , 2001 .

[19]  Manuel Palomar,et al.  Reducing inconsistency in integrating data from different sources , 2001, Proceedings 2001 International Database Engineering and Applications Symposium.

[20]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[21]  Graham A. Stephen String Searching Algorithms , 1994, Lecture Notes Series on Computing.

[22]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[23]  Martha E. Fair RECORD LINKAGE IN AN INFORMATION AGE SOCIETY , 1996 .

[24]  Casimir A. Kulikowski,et al.  Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning and Expert Systems , 1990 .

[25]  Arbee L. P. Chen,et al.  Identifying object isomerism in multidatabase systems , 2004, Distributed and Parallel Databases.

[26]  William E. Winkler Record Linkage Software and Methods for Merging Administrative Lists , 2001 .