Building a National HIV Cohort from Routine Laboratory Data: Probabilistic Record-Linkage with Graphs

Background Chronic disease management requires the ability to link patient records across multiple interactions with the health sector. South Africa’s National Health Laboratory Service (NHLS) conducts all routine laboratory monitoring for the country’s national public sector HIV program. However, the absence of a validated patient identifier has limited the potential of the NHLS database for epidemiological research, policy evaluation, and longitudinal patient care. We developed and validated a record linkage algorithm, creating a unique patient identifier and enabling analysis of the NHLS database as a national HIV cohort. To our knowledge, this is the first national HIV cohort in any low-or middle-income country. Methods. We linked data on all CD4 counts, HIV viral loads (VL), and ART workup laboratory tests from 2004-2016. Each NHLS laboratory test result is associated with a name, sex, date of birth (DOB), gender, and facility. However, due to typographical and other errors and patient mobility between facilities, different patient specimens may be associated with different sets of identifying information. We developed a graph-based probabilistic record linkage algorithm and used it to construct a unique identifier for all patients with laboratory results in the national HIV program. We used standard probabilistic linkage methods with Jaro-Winkler string comparisons and weights informed by response frequency. We also used graph concepts to guide the linkage in determining whether a cluster of patient specimens could plausibly reflect a single patient. This approach allows matching thresholds to vary with the density of the network and limits over-matching. To train and validate our approach, we constructed a quasi-gold standard based on manual review of 59,000 candidate matches associated with 1000 randomly sampled specimens. These data were divided into training and validation sets. Domain weights and graph parameters were optimized using the manually matched training data. To evaluate performance, we calculated the probability that a true match was correctly identified by our algorithm (sensitivity, Sen) and the probability that a match identified by our algorithm was truly a match (positive predictive value, PPV) in the manually-matched data. We also assessed validity in the full cohort using proxies for under-and over-matching and assessed sensitivity vis-à-vis national identification numbers and patient folder numbers, which were available for a sub-set of records. We compared the performance of our algorithm for exact matching and a prior identifier that had been developed by the NHLS Corporate Data Warehouse. Results. As of December 2016, the NHLS database contained 117 million patient specimens with a CD4, VL, or other laboratory test used in HIV care. These specimens had 63 million unique combinations of patient identifying information. From these data, our matching algorithm identified 11.6 million unique HIV patients who had at least one CD4 count or VL result. These patients 70.9 million total specimens, with a median of 3 specimens per patient (IQR 1 to 8). Sensitivity and PPV of the algorithm were estimated to be 93.7% and 98.6% in manually-matched data, compared to 64.1% and 100.0% for the existing NHLS identifier. We estimated that in 2016 there were 3.35 million patients on ART and virologically monitored, similar to the National Department of Health estimate of 3.50 million. Conclusion. We constructed a South African National HIV Cohort by applying novel graph-based probabilistic record linkage techniques to routinely collected laboratory data, with high sensitivity and positive predictive value. Information on graph structure can guide record linkage in large populations when identifying data are limited.

[1]  D M Hawkins,et al.  Some issues in resolution of diagnostic tests using an imperfect gold standard , 2001, Statistics in medicine.

[2]  P. Bühlmann,et al.  Analyzing Bagging , 2001 .

[3]  Rajeev Motwani,et al.  Robust identification of fuzzy duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[4]  L. Opie HIV/AIDS in South Africa , 2005, The Lancet.

[5]  William E. Winkler,et al.  Data quality and record linkage techniques , 2007 .

[6]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[7]  Michael Rayment,et al.  Prevention of HIV-1 infection with early antiretroviral therapy , 2012, Journal of Family Planning and Reproductive Health Care.

[8]  Till Bärnighausen,et al.  Regression Discontinuity Designs in Epidemiology: Causal Inference Without Randomized Trials , 2014 .

[9]  M. Egger,et al.  Estimating the impact of antiretroviral treatment on adult mortality trends in South Africa: A mathematical modelling study , 2017, PLoS medicine.

[10]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[11]  James J. Feigenbaum JAROWINKLER: Stata module to calculate the Jaro-Winkler distance between strings , 2014 .

[12]  William E. Yancey Evaluating String Comparator Performance for Record Linkage , 2005 .

[13]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[14]  W. Msemburi,et al.  HIV/AIDS in South Africa: how many people died from the disease between 1997 and 2010? , 2016, AIDS.

[15]  Sean M. Randall,et al.  Use of graph theory measures to identify errors in record linkage , 2014, Comput. Methods Programs Biomed..

[16]  Felix Naumann,et al.  Scalable Iterative Graph Duplicate Detection , 2012, IEEE Transactions on Knowledge and Data Engineering.

[17]  Till Bärnighausen,et al.  Increases in Adult Life Expectancy in Rural South Africa: Valuing the Scale-Up of HIV Treatment , 2013, Science.

[18]  Renée J. Miller,et al.  Framework for Evaluating Clustering Algorithms in Duplicate Detection , 2009, Proc. VLDB Endow..

[19]  Jun Zhou,et al.  A Graph Matching Method for Historical Census Household Linkage , 2014, PAKDD.

[20]  Matthew A. Jaro,et al.  Probabilistic linkage of large public health data files. , 1995, Statistics in medicine.

[21]  Margaret E Kruk,et al.  Redesigning primary care to tackle the global epidemic of noncommunicable disease. , 2015, American journal of public health.

[22]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[23]  L. Johnson,et al.  Progress towards the 2020 targets for HIV diagnosis and antiretroviral treatment in South Africa , 2017, Southern African journal of HIV medicine.

[24]  M. Mimiaga,et al.  Impact of early antiretroviral therapy eligibility on HIV acquisition: household-level evidence from rural South Africa , 2018, AIDS.

[25]  A. Sarah Walker,et al.  An efficient record linkage scheme using graphical analysis for identifier error detection , 2011, BMC Medical Informatics Decis. Mak..

[26]  M. Egger,et al.  Life Expectancies of South African Adults Starting Antiretroviral Treatment: Collaborative Analysis of Cohort Studies , 2013, PLoS medicine.

[27]  Erik-André Sauleau,et al.  Medical record linkage in health information systems by approximate string matching and clustering , 2005, BMC Medical Informatics Decis. Mak..

[28]  T. Bärnighausen,et al.  In a study of a population cohort in South Africa, HIV patients on antiretrovirals had nearly full recovery of employment. , 2012, Health affairs.

[29]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.