Detecting Multi-Relationship Links in Sparse Datasets

Application areas such as healthcare and insurance see many patients or clients with their lifetime record spread across the databases of different providers. Record linkage is the task where algorithms are used to identify the same individual contained in different datasets. In cases where unique identifiers are found, linking those records is a trivial task. However, there are very high numbers of individuals who cannot be matched as common identifiers do not exist across datasets and their identifying information is not exact or often, quite different (e.g. a change of address). In this research, we provide a new approach to record linkage which also includes the ability to detect relationships between customers (e.g. family). A validation is presented which highlights the best parameter and configuration settings for the types of relationship links that are required.

[1]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[2]  Ailish Hannigan,et al.  A new computationally efficient algorithm for record linkage with field dependency and missing data imputation , 2018, Int. J. Medical Informatics.

[3]  Injazz J. Chen,et al.  Understanding customer relationship management (CRM): People, process and technology , 2003, Bus. Process. Manag. J..

[4]  Bi Liu,et al.  A Normalized Levenshtein Distance Metric , 2007, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Michelle Cheatham,et al.  An Analysis of Blocking Methods for Private Record Linkage , 2016, AAAI Fall Symposia.

[6]  Raymond J. Mooney,et al.  Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[7]  Dimitar Kazakov,et al.  WordNet-based text document clustering , 2004 .

[8]  H. Edelsbrunner,et al.  Efficient algorithms for agglomerative hierarchical clustering methods , 1984 .

[9]  Mark Roantree,et al.  Record Linkage Using A Domain Knowledge Ruleset , 2019 .

[10]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[11]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[12]  J. Kruskal An Overview of Sequence Comparison: Time Warps, String Edits, and Macromolecules , 1983 .

[13]  Mark Roantree,et al.  A heuristic approach to selecting views for materialization , 2014, Softw. Pract. Exp..

[14]  Erhard Rahm The Case for Holistic Data Integration , 2016, ADBIS.

[15]  C. Anthony Di Benedetto,et al.  Customer equity and value management of global brands: Bridging theory and practice from financial and marketing perspectives: Introduction to a Journal of Business Research Special Section , 2016 .

[16]  Mark Roantree,et al.  Anomaly detection in agri warehouse construction , 2017, ACSW.

[17]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[18]  Peter Christen,et al.  A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[19]  Steffen Staab,et al.  Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[20]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[21]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[22]  Sanguthevar Rajasekaran,et al.  Efficient Record Linkage Algorithms Using Complete Linkage Clustering , 2016, PloS one.

[23]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[24]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[25]  Mark Roantree,et al.  Integrating Sensor Streams in pHealth Networks , 2008, 2008 14th IEEE International Conference on Parallel and Distributed Systems.