Entity Matching: A Case Study in the Medical Domain

In this paper, we propose a simple and effective solution for the entity matching problem involving data records of healthcare professionals. Our method depends on three attributes that are available in most data sources in the medical domain: name, specialty and address. We apply a blocking technique to avoid comparisons, three matchers for conciliating the data records and a rule-based heuristic to combine the matchers. We performed experiments involving data from three Brazilian Web sources of healthcare professionals. Our results show that our solution is able to avoid unnecessary comparisons and provides good results.

[1]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[2]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[3]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[4]  Marcos André Gonçalves,et al.  A brief survey of automatic methods for author name disambiguation , 2012, SGMD.

[5]  Pedro M. Domingos,et al.  Entity Resolution with Markov Logic , 2006, Sixth International Conference on Data Mining (ICDM'06).

[6]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[7]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[8]  Won-Kyung Sung,et al.  On co-authorship for author disambiguation , 2009, Inf. Process. Manag..

[9]  Marcos André Gonçalves,et al.  A Genetic Programming Approach to Record Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[10]  Richard Chbeir,et al.  User Profile Matching in Social Networks , 2010, 2010 13th International Conference on Network-Based Information Systems.

[11]  Andreas Thor,et al.  MOMA - A Mapping-based Object Matching System , 2007, CIDR.

[12]  David Guy Brizan,et al.  A. Survey of Entity Resolution and Record Linkage Methodologies , 2015, Communications of the IIMA.

[13]  Tok Wang Ling,et al.  IntelliClean: a knowledge-based intelligent data cleaner , 2000, KDD '00.

[14]  Dongwon Lee,et al.  Search engine driven author disambiguation , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[15]  Rakesh Agrawal,et al.  Aggregating web offers to determine product prices , 2012, KDD.

[16]  Felix Naumann,et al.  A generalization of blocking and windowing algorithms for duplicate detection , 2011, 2011 International Conference on Data and Knowledge Engineering (ICDKE).

[17]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[18]  Erhard Rahm,et al.  Frameworks for entity matching: A comparison , 2010, Data Knowl. Eng..

[19]  Andreas Thor,et al.  Dedoop: Efficient Deduplication with Hadoop , 2012, Proc. VLDB Endow..

[20]  Renée J. Miller,et al.  Framework for Evaluating Clustering Algorithms in Duplicate Detection , 2009, Proc. VLDB Endow..

[21]  Marcos André Gonçalves,et al.  Remoção de Ambiguidades na Identificação de Autoria de Objetos Bibliográficos , 2005, SBBD.

[22]  Andreas Thor,et al.  Tailoring entity resolution for matching product offers , 2012, EDBT '12.