Identifying Cases of Type 2 Diabetes in Heterogeneous Data Sources: Strategy from the EMIF Project

Due to the heterogeneity of existing European sources of observational healthcare data, data source-tailored choices are needed to execute multi-data source, multi-national epidemiological studies. This makes transparent documentation paramount. In this proof-of-concept study, a novel standard data derivation procedure was tested in a set of heterogeneous data sources. Identification of subjects with type 2 diabetes (T2DM) was the test case. We included three primary care data sources (PCDs), three record linkage of administrative and/or registry data sources (RLDs), one hospital and one biobank. Overall, data from 12 million subjects from six European countries were extracted. Based on a shared event definition, sixteeen standard algorithms (components) useful to identify T2DM cases were generated through a top-down/bottom-up iterative approach. Each component was based on one single data domain among diagnoses, drugs, diagnostic test utilization and laboratory results. Diagnoses-based components were subclassified considering the healthcare setting (primary, secondary, inpatient care). The Unified Medical Language System was used for semantic harmonization within data domains. Individual components were extracted and proportion of population identified was compared across data sources. Drug-based components performed similarly in RLDs and PCDs, unlike diagnoses-based components. Using components as building blocks, logical combinations with AND, OR, AND NOT were tested and local experts recommended their preferred data source-tailored combination. The population identified per data sources by resulting algorithms varied from 3.5% to 15.7%, however, age-specific results were fairly comparable. The impact of individual components was assessed: diagnoses-based components identified the majority of cases in PCDs (93–100%), while drug-based components were the main contributors in RLDs (81–100%). The proposed data derivation procedure allowed the generation of data source-tailored case-finding algorithms in a standardized fashion, facilitated transparent documentation of the process and benchmarking of data sources, and provided bases for interpretation of possible inter-data source inconsistency of findings in future studies.

[1]  Edoardo Vacchi,et al.  Data Extraction and Management in Networks of Observational Health Care Databases for Scientific Research: A Comparison of EU-ADR, OMOP, Mini-Sentinel and MATRICE Strategies , 2016, EGEMS.

[2]  RECOMMENDATIONS AND ABSTRACTS: The EUROmediCAT Project , 2015 .

[3]  L. Smeeth,et al.  The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) Statement , 2015, PLoS medicine.

[4]  R. Mägi,et al.  Cohort Profile Cohort Profile : Estonian Biobank of the Estonian Genome Center , University of Tartu , 2015 .

[5]  Katherine I. Morley,et al.  Defining Disease Phenotypes Using National Linked Electronic Health Records: A Case Study of Atrial Fibrillation , 2014, PloS one.

[6]  Rosa Gini,et al.  Validation study in four health-care databases: upper gastrointestinal bleeding misclassification affects precision but not magnitude of drug-related upper gastrointestinal bleeding risk. , 2014, Journal of clinical epidemiology.

[7]  Shelley A. Rusincovitch,et al.  Clinical Research Informatics and Electronic Health Record Data , 2014, Yearbook of Medical Informatics.

[8]  J. Lei,et al.  Combining multiple healthcare databases for postmarketing drug and vaccine safety surveillance: why and how? , 2014, Journal of internal medicine.

[9]  J. Pathak,et al.  Electronic health records-driven phenotyping: challenges, recent advances, and perspectives. , 2013, Journal of the American Medical Informatics Association : JAMIA.

[10]  Shelley A. Rusincovitch,et al.  A comparison of phenotype definitions for diabetes mellitus. , 2013, Journal of the American Medical Informatics Association : JAMIA.

[11]  George Hripcsak,et al.  A collaborative approach to developing an electronic health record phenotyping algorithm for drug-induced liver injury. , 2013, Journal of the American Medical Informatics Association : JAMIA.

[12]  H. Sørensen,et al.  Treatment of HIV and risk of multiple sclerosis. , 2013, Epidemiology.

[13]  Martijn J Schuemie,et al.  Chronic disease prevalence from Italian administrative databases in the VALORE project: a validation through comparison of population estimates with general practice databases and national survey , 2013, BMC Public Health.

[14]  Henrik Toft Sørensen,et al.  Existing data sources for clinical epidemiology: The Danish National Database of Reimbursed Prescriptions , 2012, Clinical epidemiology.

[15]  George Hripcsak,et al.  Next-generation phenotyping of electronic health records , 2012, J. Am. Medical Informatics Assoc..

[16]  Suzette J. Bielinski,et al.  Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study , 2012, J. Am. Medical Informatics Assoc..

[17]  V. Lemmens,et al.  Record linkage for pharmacoepidemiological studies in cancer patients , 2012, Pharmacoepidemiology and drug safety.

[18]  A. Bourke,et al.  Generalisability of The Health Improvement Network (THIN) database: demographics, chronic disease prevalence and mortality rates. , 2011, Informatics in primary care.

[19]  Miguel A Hernán,et al.  With great data comes great responsibility: publishing comparative effectiveness research in epidemiology. , 2011, Epidemiology.

[20]  M. Schuemie,et al.  Combining electronic healthcare databases in Europe to allow for large‐scale drug safety monitoring: the EU‐ADR Project , 2011, Pharmacoepidemiology and drug safety.

[21]  K. Borch-Johnsen,et al.  The Danish National Diabetes Register: trends in incidence, prevalence and mortality , 2008, Diabetologia.

[22]  J H van Bemmel,et al.  Postmarketing Surveillance Based on Electronic Patient Records: The IPCI Project , 1999, Methods of Information in Medicine.

[23]  Laura Inés Furlong,et al.  Reuse of EHRs to Support Clinical Research in a Hospital of Reference , 2015, MIE.

[24]  M. Schuemie,et al.  Automatic Identification of Stages of Type 2 Diabetes, Hypertension, Ischaemic Heart Disease and Heart Failure from Italian General Practitioners' Electronic Medical Records: A Validation Study , 2015 .

[25]  Marius Fieschi,et al.  Harmonization process for the identification of medical events in eight European healthcare databases: the experience from the EU-ADR project , 2013, J. Am. Medical Informatics Assoc..

[26]  Мария Анатольевна Васильева Доступ к OECD iLibrary , 2013 .

[27]  Christopher G Chute,et al.  Analyzing the heterogeneity and complexity of Electronic Health Record oriented phenotyping algorithms. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[28]  Diagnosis and classification of diabetes mellitus. , 2005, Diabetes care.

[29]  M. Martín-Baranera,et al.  IMASIS. A multicenter hospital information system--experience in Barcelona. , 1998, Studies in health technology and informatics.