A transparent and transportable methodology for evaluating Data Linkage software

There has been substantial growth in Data Linkage (DL) activities in recent years. This reflects growth in both the demand for, and the supply of, linked or linkable data. Increased utilisation of DL "services" has brought with it increased need for impartial information about the suitability and performance capabilities of DL software programs and packages. Although evaluations of DL software exist; most have been restricted to the comparison of two or three packages. Evaluations of a large number of packages are rare because of the time and resource burden placed on the evaluators and the need for a suitable "gold standard" evaluation dataset. In this paper we present an evaluation methodology that overcomes a number of these difficulties. Our approach involves the generation and use of representative synthetic data; the execution of a series of linkages using a pre-defined linkage strategy; and the use of standard linkage quality metrics to assess performance. The methodology is both transparent and transportable, producing genuinely comparable results. The methodology was used by the Centre for Data Linkage (CDL) at Curtin University in an evaluation of ten DL software packages. It is also being used to evaluate larger linkage systems (not just packages). The methodology provides a unique opportunity to benchmark the quality of linkages in different operational environments.

[1]  Howard B. Newcombe,et al.  Handbook of record linkage: methods for health and statistical studies, administration, and business , 1988 .

[2]  K. Stowman World health statistics. , 1949, The Milbank Memorial Fund quarterly.

[3]  Computerised record linkage: compared with traditional patient follow-up methods in clinical trials and illustrated in a prospective epidemiological study. The West of Scotland Coronary Prevention Study Group. , 1995, Journal of clinical epidemiology.

[4]  Peter Christen,et al.  Febrl: a freely available record linkage system with a graphical user interface , 2008 .

[5]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[6]  D. Rosman,et al.  Health data linkage conserves privacy in a research-rich environment. , 2006, Annals of epidemiology.

[7]  Stasha Ann Bown Larsen,et al.  Record Linkage , 2018, Encyclopedia of Database Systems.

[8]  D. Clark,et al.  Comparison of probabilistic and deterministic record linkage in the development of a statewide trauma registry. , 1995, Proceedings. Symposium on Computer Applications in Medical Care.

[9]  B. Sibthorpe,et al.  Record linkage in Australian epidemiological research: health benefits, privacy safeguards and future potential. , 2010, Australian journal of public health.

[10]  William E. Winkler,et al.  Record linkage , 2010 .

[11]  Peter Christen,et al.  Probabilistic Data Generation for Deduplication and Data Linkage , 2005, IDEAL.

[12]  L. Ohno-Machado Journal of Biomedical Informatics , 2001 .

[13]  D. Rosman,et al.  Public good through data linkage: measuring research outputs from the Western Australian Data Linkage System , 2008, Australian and New Zealand journal of public health.

[14]  Charles Day Record Linkage I : Evaluation of Commercially Available Record Linkage Software for Use in NASS , 2007 .

[15]  James B Semmens,et al.  Improving the evidence base for promoting quality and equity of surgical care using population-based linkage of administrative health records. , 2005, International journal for quality in health care : journal of the International Society for Quality in Health Care.

[16]  Shanti Gomatam,et al.  An empirical comparison of record linkage procedures , 2002, Statistics in medicine.

[17]  Fabrice Guillet,et al.  Quality Measures in Data Mining (Studies in Computational Intelligence) , 2007 .

[18]  Patient Data Matching Software: A Buyer’s Guide for the Budget Conscious , 2004 .

[19]  Luca De Santis,et al.  Automatic Record Matching in Cooperative Information Systems , 2002 .

[20]  A. J. Bass,et al.  Population‐based linkage of health records in Western Australia: development of a health services research linked database , 1999, Australian and New Zealand journal of public health.

[21]  Peter Christen,et al.  Quality and Complexity Measures for Data Linkage and Deduplication , 2007, Quality Measures in Data Mining.

[22]  Howard B. Newcombe,et al.  Record linkage: making maximum use of the discriminating power of identifying information , 1962, CACM.

[23]  A. J. Bass,et al.  A decade of data linkage in Western Australia: strategic design, applications and benefits of the WA data linkage system. , 2008, Australian health review : a publication of the Australian Hospital Association.

[24]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[25]  Dennis Deck,et al.  Record linkage software in the public domain: a comparison of Link Plus, The Link King, and a `basic' deterministic algorithm , 2008, Health Informatics J..