Identifying clinically important COPD sub-types using data-driven approaches in primary care population based electronic health records

BackgroundCOPD is a highly heterogeneous disease composed of different phenotypes with different aetiological and prognostic profiles and current classification systems do not fully capture this heterogeneity. In this study we sought to discover, describe and validate COPD subtypes using cluster analysis on data derived from electronic health records.MethodsWe applied two unsupervised learning algorithms (k-means and hierarchical clustering) in 30,961 current and former smokers diagnosed with COPD, using linked national structured electronic health records in England available through the CALIBER resource. We used 15 clinical features, including risk factors and comorbidities and performed dimensionality reduction using multiple correspondence analysis. We compared the association between cluster membership and COPD exacerbations and respiratory and cardiovascular death with 10,736 deaths recorded over 146,466 person-years of follow-up. We also implemented and tested a process to assign unseen patients into clusters using a decision tree classifier.ResultsWe identified and characterized five COPD patient clusters with distinct patient characteristics with respect to demographics, comorbidities, risk of death and exacerbations. The four subgroups were associated with 1) anxiety/depression; 2) severe airflow obstruction and frailty; 3) cardiovascular disease and diabetes and 4) obesity/atopy. A fifth cluster was associated with low prevalence of most comorbid conditions.ConclusionsCOPD patients can be sub-classified into groups with differing risk factors, comorbidities, and prognosis, based on data included in their primary care records. The identified clusters confirm findings of previous clustering studies and draw attention to anxiety and depression as important drivers of the disease in young, female patients.

[1]  Dipak Kalra,et al.  Data Resource Profile: Cardiovascular disease research using linked bespoke studies and electronic health records (CALIBER) , 2012, International journal of epidemiology.

[2]  P. Calverley,et al.  Global strategy for the diagnosis, management, and prevention of chronic obstructive pulmonary disease: GOLD executive summary. , 2007, American journal of respiratory and critical care medicine.

[3]  J. Bourbeau,et al.  Derivation and validation of clinical phenotypes for COPD: a systematic review , 2015, Respiratory Research.

[4]  L. Smeeth,et al.  Recording of hospitalizations for acute exacerbations of COPD in UK electronic health care records , 2016, Clinical epidemiology.

[5]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[6]  F. Martinez,et al.  Current concepts in targeting chronic obstructive pulmonary disease pharmacotherapy: making progress towards personalised management , 2015, The Lancet.

[7]  K. Bhaskaran,et al.  Data Resource Profile: Clinical Practice Research Datalink (CPRD) , 2015, International journal of epidemiology.

[8]  Stephanie A. Santorico,et al.  Cluster analysis in the COPDGene study identifies subtypes of smokers with distinct patterns of airway disease and emphysema , 2014, Thorax.

[9]  Tudor I. Oprea,et al.  Chronic obstructive pulmonary disease phenotypes using cluster analysis of electronic medical records , 2018, Health Informatics J..

[10]  C. Tappert,et al.  A Survey of Binary Similarity and Distance Measures , 2010 .

[11]  M. Kuroda,et al.  Multiple Correspondence Analysis , 2016 .

[12]  Spiros C. Denaxas,et al.  Big data from electronic health records for early and late translational cardiovascular research: challenges and potential , 2017, European heart journal.

[13]  Katherine I. Morley,et al.  Defining Disease Phenotypes Using National Linked Electronic Health Records: A Case Study of Atrial Fibrillation , 2014, PloS one.

[14]  A. Agustí The path to personalised medicine in COPD , 2014, Thorax.

[15]  K. Walters,et al.  Depression as a Risk Factor for the Initial Presentation of Twelve Cardiac, Cerebrovascular, and Peripheral Arterial Diseases: Data Linkage Study of 1.9 Million Women and Men , 2016, PloS one.

[16]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[17]  Spiros C. Denaxas,et al.  Completeness and diagnostic validity of recording acute myocardial infarction events in primary care, hospital care, disease registry, and national mortality records: cohort study , 2013, BMJ.

[18]  Courtney Crim,et al.  Identification of five chronic obstructive pulmonary disease subgroups with different prognoses in the ECLIPSE cohort using cluster analysis. , 2015, Annals of the American Thoracic Society.

[19]  Harry Hemingway,et al.  Blood pressure and incidence of twelve cardiovascular diseases: lifetime risks, healthy life-years lost, and age-specific associations in 1·25 million people , 2014, The Lancet.

[20]  L. Smeeth,et al.  Validation of the Recording of Acute Exacerbations of COPD in UK Primary Care Electronic Healthcare Records , 2016, PloS one.

[21]  Spiros C. Denaxas,et al.  Big biomedical data and cardiovascular disease research: opportunities and challenges. , 2015, European heart journal. Quality of care & clinical outcomes.

[22]  Spiros Denaxas,et al.  Prognostic burden of heart failure recorded in primary care, acute hospital admissions, or both: a population‐based linked electronic health record cohort study in 2.1 million people , 2016, European journal of heart failure.

[23]  Nicolas Roche,et al.  Identification of Clinical Phenotypes Using Cluster Analyses in COPD Patients with Multiple Comorbidities , 2014, BioMed research international.

[24]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[25]  J. Soriano An Epidemiological Overview of Chronic Obstructive Pulmonary Disease: What Can Real-Life Data Tell Us about Disease Management? , 2017, COPD.

[26]  L. Smeeth,et al.  Concomitant diagnosis of asthma and COPD: a quantitative study in UK primary care , 2018, The British journal of general practice : the journal of the Royal College of General Practitioners.

[27]  L. Smeeth,et al.  Validation of chronic obstructive pulmonary disease recording in the Clinical Practice Research Datalink (CPRD-GOLD) , 2014, BMJ Open.

[28]  L. Smeeth,et al.  Natural History of Chronic Obstructive Pulmonary Disease Exacerbations in a General Practice‐based Population with Chronic Obstructive Pulmonary Disease , 2018, American journal of respiratory and critical care medicine.

[29]  B. Celli,et al.  What does endotyping mean for treatment in chronic obstructive pulmonary disease? , 2017, The Lancet.

[30]  C. Sudlow,et al.  UK phenomics platform for developing and validating EHR phenotypes: CALIBER , 2019, bioRxiv.

[31]  Jennifer G. Dy,et al.  Do COPD subtypes really exist? COPD heterogeneity and clustering in 10 independent cohorts , 2017, Thorax.

[32]  Harry Hemingway,et al.  An electronic health records cohort study on heart failure following myocardial infarction in England: incidence and predictors , 2018, BMJ Open.

[33]  M. Decramer,et al.  A simple algorithm for the identification of clinical COPD phenotypes , 2017, European Respiratory Journal.

[34]  L. Groop,et al.  Novel subgroups of adult-onset diabetes and their association with outcomes: a data-driven cluster analysis of six variables. , 2018, The lancet. Diabetes & endocrinology.

[35]  C. Mathers,et al.  Projections of Global Mortality and Burden of Disease from 2002 to 2030 , 2006, PLoS medicine.