Demystifying probabilistic linkage: Common myths and misconceptions

Abstract Many of the distinctions made between probabilistic and deterministic linkage are misleading. While these two approaches to record linkage operate in different ways and can produce different outputs, the distinctions between them are more a result of how they are implemented than because of any intrinsic differences. In the way they are generally applied, probabilistic and deterministic procedures can be little more than alternative means to similar ends—or they can arrive at very different ends depending on choices that are made during implementation. Misconceptions about probabilistic linkage contribute to reluctance for implementing it and mistrust of its outputs. We aim to explain how the outputs of either approach can be tailored to suit the intended application, but also to highlight the ways in which probabilistic linkage is generally more flexible, more powerful and more informed by the data. This is accomplished by examining common misconceptions about probabilistic linkage and its difference from deterministic linkage, highlighting the potential impact of design choices on the outputs of either approach. We hope that better understanding of linkage designs will help to allay concerns about probabilistic linkage, and help data linkers to select and tailor procedures to produce outputs that are appropriate for their intended use.

[1]  William E. Winkler,et al.  Data quality and record linkage techniques , 2007 .

[2]  Sean M. Randall,et al.  Data linkage infrastructure for cross-jurisdictional health-related research in Australia , 2012, BMC Health Services Research.

[3]  Arie Hasman,et al.  Results from simulated data sets: probabilistic record linkage outperforms deterministic record linkage. , 2011, Journal of clinical epidemiology.

[4]  Fiona Steele,et al.  Probabilistic record linkage , 2015, International journal of epidemiology.

[5]  K. Harron,et al.  Linking Data for Mothers and Babies in De-Identified Electronic Health Data , 2016, PloS one.

[6]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[7]  P. Lahiri,et al.  Regression Analysis With Linked Data , 2005 .

[8]  Harvey Goldstein,et al.  The analysis of record‐linked data using multiple imputation with data value priors , 2012, Statistics in medicine.

[9]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[10]  M. Elliot,et al.  An Introduction to Data Linkage , 2016 .

[11]  H. Goldstein,et al.  Evaluating bias due to data linkage error in electronic healthcare records , 2014, BMC Medical Research Methodology.

[12]  Harvey Goldstein,et al.  GUILD: GUidance for Information about Linking Data sets† , 2017, Journal of public health.

[13]  D. Clark,et al.  Comparison of probabilistic and deterministic record linkage in the development of a statewide trauma registry. , 1995, Proceedings. Symposium on Computer Applications in Medical Care.

[14]  Harvey Goldstein,et al.  A scaling approach to record linkage , 2017, Statistics in medicine.

[15]  Charles W. Given,et al.  Medicaid, Medicare, and the Michigan Tumor Registry: A Linkage Strategy , 2007, Medical decision making : an international journal of the Society for Medical Decision Making.

[16]  Matthew A. Jaro,et al.  Probabilistic linkage of large public health data files. , 1995, Statistics in medicine.

[17]  Brunero Liseo,et al.  A hierarchical Bayesian approach to record linkage and population size problems , 2010, 1011.2649.

[18]  Harvey Goldstein,et al.  Utilising identifier error variation in linkage of large administrative data sources , 2017, BMC Medical Research Methodology.

[19]  Shanti Gomatam,et al.  An empirical comparison of record linkage procedures , 2002, Statistics in medicine.

[20]  Shaun J. Grannis,et al.  A practical approach for incorporating dependence among fields in probabilistic record linkage , 2013, BMC Medical Informatics and Decision Making.

[21]  Harvey Goldstein,et al.  Methodological Developments in Data Linkage , 2015 .