Assessing data linkage quality in cohort studies

Abstract Background: Linkage of administrative data sources provides an efficient means of collecting detailed data on how individuals interact with cross-sectoral services, society, and the environment. These data can be used to supplement conventional cohort studies, or to create population-level electronic cohorts generated solely from administrative data. However, errors occurring during linkage (false matches/missed matches) can lead to bias in results from linked data. Aim: This paper provides guidance on evaluating linkage quality in cohort studies. Methods: We provide an overview of methods for linkage, describe mechanisms by which linkage error can introduce bias, and draw on real-world examples to demonstrate methods for evaluating linkage quality. Results: Methods for evaluating linkage quality described in this paper provide guidance on (i) estimating linkage error rates, (ii) understanding the mechanisms by which linkage error might bias results, and (iii) information that should be shared between data providers, linkers and users, so that approaches to handling linkage error in analysis can be implemented. Conclusion: Linked administrative data can enhance conventional cohorts and offers the ability to answer questions that require large sample sizes or hard-to-reach populations. Care needs to be taken to evaluate linkage quality in order to provide robust results.

[1]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[2]  R. Jones Paediatric intensive care. , 1973, The Practitioner.

[3]  D. Rubin Multiple imputation for nonresponse in surveys , 1989 .

[4]  W. Nobnop,et al.  Quality assurance. , 1998, Nursing standard (Royal College of Nursing (Great Britain) : 1987).

[5]  M L Barer,et al.  Creating a Population-based Linked Health Database: A New Resource for Health Services Research , 1998, Canadian journal of public health = Revue canadienne de sante publique.

[6]  T. Blakely,et al.  Probabilistic record linkage and a method to calculate the positive predictive value. , 2002, International journal of epidemiology.

[7]  A. J. Bass,et al.  Research use of linked health data — a best practice protocol , 2002, Australian and New Zealand journal of public health.

[8]  Thanaa M. Ghanem,et al.  Record Linkage: A Machine Learning Approach, A Toolbox, and a Digital Government Web Service , 2003 .

[9]  L. Taylor,et al.  Characteristics of unmatched maternal and baby records in linked birth records and hospital discharge data. , 2006, Paediatric and perinatal epidemiology.

[10]  Peter Christen,et al.  Quality and Complexity Measures for Data Linkage and Deduplication , 2007, Quality Measures in Data Mining.

[11]  Heather Joshi,et al.  Linking Millennium Cohort data to birth registration and hospital episode records. , 2007, Paediatric and perinatal epidemiology.

[12]  J. Ludvigsson,et al.  The Swedish personal identity number: possibilities and pitfalls in healthcare and medical research , 2009, European Journal of Epidemiology.

[13]  M. Kenward,et al.  Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls , 2009, BMJ : British Medical Journal.

[14]  R. Lyons,et al.  The SAIL Databank: building a national architecture for e-health research and evaluation , 2009, BMC health services research.

[15]  Peter Christen,et al.  Accurate Synthetic Generation of Realistic Personal Information , 2009, PAKDD.

[16]  Harvey Goldstein,et al.  Multilevel models with multivariate mixed response types , 2009 .

[17]  Ian Scott,et al.  Data Linkage: A powerful research tool with potential problems , 2010, BMC health services research.

[18]  Joseph T. Lariscy,et al.  Differential Record Linkage by Hispanic Ethnicity and Age in Linked Mortality Studies , 2011, Journal of aging and health.

[19]  M. Brownell,et al.  Administrative record linkage as a tool for public health research. , 2011, Annual review of public health.

[20]  Harvey Goldstein,et al.  The analysis of record‐linked data using multiple imputation with data value priors , 2012, Statistics in medicine.

[21]  E. Lawson,et al.  Linkage of a clinical surgical registry with Medicare inpatient claims data using indirect identifiers. , 2013, Surgery.

[22]  Parminder Raina,et al.  Linking Canadian Population Health Data: Maximizing the Potential of Cohort and Administrative Data , 2013, Canadian Journal of Public Health.

[23]  Harvey Goldstein,et al.  Paediatric Intensive Care , 2013 .

[24]  M. Law,et al.  A New Method for Assessing How Sensitivity and Specificity of Linkage Studies Affects Estimation , 2014, PloS one.

[25]  H. Goldstein,et al.  Evaluating bias due to data linkage error in electronic healthcare records , 2014, BMC Medical Research Methodology.

[26]  David Moher,et al.  The REporting of Studies Conducted Using Observational Routinely-Collected Health Data (RECORD) Statement: Methods for Arriving at Consensus and Developing Reporting Guidelines , 2015, PloS one.

[27]  Louisa Jorm,et al.  Routinely collected data as a strategic resource for research: priorities for methods and workforce. , 2015, Public health research & practice.

[28]  R. Gilbert,et al.  Violence, self-harm and drug or alcohol misuse in adolescents admitted to hospitals in England for injury: a retrospective cohort study , 2015, BMJ Open.

[29]  Harvey Goldstein,et al.  Identifying Possible False Matches in Anonymized Hospital Administrative Data without Patient Identifiers. , 2015, Health services research.

[30]  Ibrahim Abubakar,et al.  Accuracy of Probabilistic Linkage Using the Enhanced Matching System for Public Health and Epidemiological Studies , 2015, PloS one.

[31]  Harvey Goldstein,et al.  Data linkage errors in hospital administrative data when applying a pseudonymisation algorithm to paediatric intensive care records , 2015, BMJ Open.

[32]  A. Seif,et al.  Merging Children’s Oncology Group Data with an External Administrative Database Using Indirect Patient Identifiers: A Report from the Children’s Oncology Group , 2015, PloS one.

[33]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[34]  N. Schenker,et al.  MULTIPLE IMPUTATION FOR MISSINGNESS DUE TO NONLINKAGE AND PROGRAM CHARACTERISTICS: A CASE STUDY OF THE NATIONAL HEALTH INTERVIEW SURVEY LINKED TO MEDICARE CLAIMS. , 2016, Journal of survey statistics and methodology.

[35]  Fiona Steele,et al.  Probabilistic record linkage , 2015, International journal of epidemiology.

[36]  Karey Iron,et al.  Describing the linkages of the immigration, refugees and citizenship Canada permanent resident data and vital statistics death registry to Ontario’s administrative health database , 2016, BMC Medical Informatics and Decision Making.

[37]  K. Harron,et al.  Linking Data for Mothers and Babies in De-Identified Electronic Health Data , 2016, PloS one.

[38]  H. Goldstein,et al.  Probabilistic linking to enhance deterministic algorithms and reduce linkage errors in hospital administrative data , 2017, BMJ Health & Care Informatics.

[39]  Harvey Goldstein,et al.  Combining deterministic and probabilistic matching to reduce data linkage errors in hospital administrative data , 2017 .

[40]  Spiros Denaxas,et al.  A Machine Learning Trainable Model to Assess the Accuracy of Probabilistic Record Linkage , 2017, DaWaK.

[41]  Harvey Goldstein,et al.  A scaling approach to record linkage , 2017, Statistics in medicine.

[42]  Harvey Goldstein,et al.  Challenges in administrative data linkage for research , 2017, Big Data Soc..

[43]  Antoine Bossard,et al.  On the Poisson distribution applicability to the Japanese seismic activity , 2018 .

[44]  S. Bouallègue,et al.  A New Method , 2021, Black Power and the American Myth.

[45]  K Harron,et al.  Demystifying probabilistic linkage: Common myths and misconceptions , 2018, International journal of population data science.

[46]  Harvey Goldstein,et al.  GUILD: GUidance for Information about Linking Data sets† , 2017, Journal of public health.

[47]  Spiros Denaxas,et al.  On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort , 2018, IEEE Journal of Biomedical and Health Informatics.

[48]  A. Hansell,et al.  Data Resource Profile: The ALSPAC birth cohort as a platform to study the relationship of environment and health and social factors , 2019, International journal of epidemiology.

[49]  M. Hotopf,et al.  An approach to linking education, social care and electronic health records for children and young people in South London: a linkage study of child and adolescent mental health service data , 2019, BMJ Open.

[50]  Liam Smeeth,et al.  Administrative Data Linkage in Brazil: Potentials for Health Technology Assessment , 2019, Front. Pharmacol..

[51]  James C Doidge,et al.  Reflections on modern methods: linkage error bias , 2019, International journal of epidemiology.

[52]  O. Campbell,et al.  Validating linkage of multiple population-based administrative databases in Brazil , 2019, PloS one.

[53]  M. Hotopf,et al.  Indicators of mental disorders in UK Biobank—A comparison of approaches , 2019, International journal of methods in psychiatric research.

[54]  Joan K. Morris,et al.  Prevalence of Down's Syndrome in England, 1998–2013: Comparison of linked surveillance data and electronic health records , 2019, International journal of population data science.

[55]  L. Taylor,et al.  Centre for Health Record Linkage , 2020 .