Parallel computing techniques for high-performance probabilistic record linkage

Record linkage techniques are used to link together records from one or more data sets relating to the same entity, e.g. patient or customer. As data is often not primarily collected for data analysis purposes, a common unique identifi er is missing in many cases, and probabilistic linkage techniques have to be applied. Historical collections of administrative and other (health) data nowadays contain tens of millions of records, with new data being added at the rate of millions of records per year. Although improvements in available computing power have to some extent mitigated against the effects of this accelerating growth in the size of the data sets to be linked, large-scale probabilistic record linkage is still a slow and resource-intensive process. The ANU Data Mining Group is currently working in collaboration with Epidemiology and Surveillance Branch of the NSW Health Department on the development of improved techniques for probabilistic record linkage. Our main focus is the development of techniques that make good use of modern high-performance parallel computers, and the exploration of data mining and machine learning techniques to reduce the time consuming and tedious manual clerical review process for possible links. The developed software will be published under an open source software license. We hope to have prototype software available early in the second half of 2002.

[1]  William E. Winkler Quality of Very Large Databases , 2001 .

[2]  Dennis Shasha,et al.  An extensible Framework for Data Cleaning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[3]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[4]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[5]  Lawrence Philips,et al.  The double metaphone search algorithm , 2000 .

[6]  Howard B. Newcombe,et al.  Record linkage: making maximum use of the discriminating power of identifying information , 1962, CACM.

[7]  William W. Cohen The WHIRL Approach to Integration: An Overview , 1998 .

[8]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis , 2000, IQ.

[9]  Anil Sethi,et al.  Matching records in a national medical patient index , 2001, CACM.

[10]  William E. Winkler,et al.  Approximate String Comparison and its Effect on an Advanced Record Linkage System , 1997 .

[11]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis 1 , 2000 .

[12]  C. Kelman,et al.  Monitoring Health Care Using National Administrative Data Collections , 2000 .

[13]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[14]  George Karypis,et al.  Introduction to Parallel Computing , 1994 .

[15]  William E. Yancey Frequency-Dependent Probability Measures for Record Linkage , 2000 .

[16]  Ahmed K. Elmagarmid,et al.  Automating the approximate record-matching process , 2000, Inf. Sci..

[17]  Peter Christen,et al.  A Toolbox Approach to Flexible and Efficient Data Mining , 2001, PAKDD.