Clustering datasets with demographics and diagnosis codes

Clustering data derived from Electronic Health Record (EHR) systems is important to discover relationships between the clinical profiles of patients and as a preprocessing step for analysis tasks, such as classification. However, the heterogeneity of these data makes the application of existing clustering methods difficult and calls for new clustering approaches. In this paper, we propose the first approach for clustering a dataset in which each record contains a patient's values in demographic attributes and their set of diagnosis codes. Our approach represents the dataset in a binary form in which the features are selected demographic values, as well as combinations (patterns) of frequent and correlated diagnosis codes. This representation enables measuring similarity between records using cosine similarity, an effective measure for binary-represented data, and finding compact, well-separated clusters through hierarchical clustering. Our experiments using two publicly available EHR datasets, comprised of over 26,000 and 52,000 records, demonstrate that our approach is able to construct clusters with correlated demographics and diagnosis codes, and that it is efficient and scalable.

[1]  Jiye Liang,et al.  An Algorithm for Clustering Categorical Data With Set-Valued Features , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[2]  Benjamin C. M. Fung,et al.  Hierarchical Document Clustering using Frequent Itemsets , 2003, SDM.

[3]  J. Kopec,et al.  Influence of cigarette smoking on hormone and lipid metabolism in women in late reproductive stage , 2018, Clinical interventions in aging.

[4]  Xing Chen,et al.  HAMDA: Hybrid Approach for MiRNA-Disease Association prediction , 2017, J. Biomed. Informatics.

[5]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[6]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[7]  Jinyuan You,et al.  CLOPE: a fast and effective clustering algorithm for transactional data , 2002, KDD.

[8]  N. Benowitz Safety of nicotine in smokers with hypertension. , 2001, American journal of hypertension.

[9]  S. Brunak,et al.  Mining electronic health records: towards better research applications and clinical care , 2012, Nature Reviews Genetics.

[10]  Milad Moradi,et al.  CIBS: A biomedical text summarizer using topic-based sentence clustering , 2018, J. Biomed. Informatics.

[11]  Conrad S. Tucker,et al.  An unsupervised machine learning method for discovering patient clusters based on genetic signatures , 2018, J. Biomed. Informatics.

[12]  Christian Hennig,et al.  Recovering the number of clusters in data sets with noise features using feature rescaling factors , 2015, Inf. Sci..

[13]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[14]  G. Grahne,et al.  High Performance Mining of Maximal Frequent Itemsets Gösta , 2003 .

[15]  Vipin Kumar,et al.  Mining Electronic Health Records: A Survey , 2017, 1702.03222.

[16]  Chou-Long Huang,et al.  Mechanism of hypokalemia in magnesium deficiency. , 2007, Journal of the American Society of Nephrology : JASN.

[17]  Adam Wright,et al.  An automated technique for identifying associations between medications, laboratory results and problems , 2010, J. Biomed. Informatics.

[18]  Xiaolong Wang,et al.  Text clustering approach based on maximal frequent term sets , 2009, 2009 IEEE International Conference on Systems, Man and Cybernetics.

[19]  John Liagouris,et al.  Disassociation for electronic health record privacy , 2014, J. Biomed. Informatics.

[20]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[21]  Mohammad R. Akbarzadeh-Totonchi,et al.  A hybrid type-2 fuzzy clustering technique for input data preprocessing of classification algorithms , 2014, 2014 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).

[22]  Yan Liu,et al.  Medical data mining: insights from winning two competitions , 2010, Data Mining and Knowledge Discovery.

[23]  Hae-Sang Park,et al.  A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[24]  Nataliya Sokolovska,et al.  The asymptotics of semi-supervised learning in discriminative probabilistic models , 2008, ICML '08.

[25]  J. Krishnan,et al.  Evaluation and Documentation of Supplemental Oxygen Requirements is Rarely Performed in Patients Hospitalized With COPD. , 2017, Chronic obstructive pulmonary diseases.

[26]  K. Finison,et al.  Risk-adjustment methods for all-payer comparative performance reporting in Vermont , 2017, BMC Health Services Research.

[27]  Spiros Skiadopoulos,et al.  Anonymizing Data with Relational and Transaction Attributes , 2013, ECML/PKDD.

[28]  Carol Friedman,et al.  A new clustering method for detecting rare senses of abbreviations in clinical notes , 2012, J. Biomed. Informatics.

[29]  Rui Henriques,et al.  BicPAMS: software for biological data analysis with pattern-based biclustering , 2017, BMC Bioinformatics.

[30]  Daniel Müllner,et al.  Modern hierarchical, agglomerative clustering algorithms , 2011, ArXiv.

[31]  Alyson K. Myers,et al.  Obstructive Sleep Apnea and Obesity: Implications for Public Health. , 2017, Sleep medicine and disorders : international journal.

[32]  Finale Doshi-Velez,et al.  Comorbidity Clusters in Autism Spectrum Disorders: An Electronic Health Record Time-Series Analysis , 2014, Pediatrics.

[33]  Iain E. Buchan,et al.  Taming EHR data: Using Semantic Similarity to reduce Dimensionality , 2013, MedInfo.

[34]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[35]  Jimeng Sun,et al.  Visual cluster analysis in support of clinical decision intelligence. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[36]  Gyeong Ho Lee,et al.  Diagnostic Analysis of Patients with Essential Hypertension Using Association Rule Mining , 2010, Healthcare informatics research.

[37]  Herng‐Ching Lin,et al.  Reflux Esophagitis and the Risk of Stroke in Young Adults: A 1-Year Population-Based Follow-Up Study , 2010, Stroke.

[38]  Maria Lucia Specchia,et al.  The impact of electronic health records on healthcare quality: a systematic review and meta-analysis. , 2016, European journal of public health.

[39]  Shyam Visweswaran,et al.  Improving Classification Performance with Discretization on Biomedical Datasets , 2008, AMIA.

[40]  Christobel Saunders,et al.  Ascertaining invasive breast cancer cases; the validity of administrative and self-reported data sources in Australia , 2013, BMC Medical Research Methodology.

[41]  Keke Chen,et al.  Efficiently clustering transactional data with weighted coverage density , 2006, CIKM '06.

[42]  Hamid Nasri,et al.  Atherosclerosis: Process, Indicators, Risk Factors and New Hopes , 2014, International journal of preventive medicine.

[43]  E. Cardona-Muñoz,et al.  Diabetic Polyneuropathy in Type 2 Diabetes Mellitus: Inflammation, Oxidative Stress, and Mitochondrial Function , 2016, Journal of diabetes research.

[44]  Donna J Cartwright,et al.  ICD-9-CM to ICD-10-CM Codes: What? Why? How? , 2013, Advances in wound care.

[45]  J. Manson,et al.  Aspirin for Primary Prevention of Atherosclerotic Cardiovascular Disease: Advances in Diagnosis and Treatment. , 2016, JAMA internal medicine.

[46]  Yücel Saygin,et al.  Anonymization of Longitudinal Electronic Medical Records , 2012, IEEE Transactions on Information Technology in Biomedicine.

[47]  L. Hwang,et al.  PREVENTION OF PERINATALLY TRANSMITTED HEPATITIS B VIRUS INFECTIONS WITH HEPATITIS B IMMUNE GLOBULIN AND HEPATITIS B VACCINE , 1983, The Lancet.

[48]  Stefan Thurner,et al.  Improving the informational continuity of care in diabetes mellitus treatment with a nationwide Shared EHR system: Estimates from Austrian claims data , 2016, Int. J. Medical Informatics.

[49]  Robert Gwadera Pattern-Based Solution Risk Model for Strategic IT Outsourcing , 2013, ICDM.

[50]  Johannes Gehrke,et al.  MAFIA: a maximal frequent itemset algorithm for transactional databases , 2001, Proceedings 17th International Conference on Data Engineering.

[51]  F. Blyth,et al.  Association Rules Analysis of Comorbidity and Multimorbidity: The Concord Health and Aging in Men Project. , 2016, The journals of gerontology. Series A, Biological sciences and medical sciences.

[52]  Valerie Guralnik,et al.  A scalable algorithm for clustering sequential data , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[53]  B. Make,et al.  Oxygen therapy for patients with COPD: current evidence and the long-term oxygen treatment trial. , 2010, Chest.

[54]  Edward Omiecinski,et al.  Alternative Interest Measures for Mining Associations in Databases , 2003, IEEE Trans. Knowl. Data Eng..

[55]  Xiaogang Wang,et al.  Efficient layered density-based clustering of categorical data , 2009, J. Biomed. Informatics.

[56]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[57]  F. Ji,et al.  The Role of Gastroesophageal Reflux in Provoking High Blood Pressure Episodes in Patients With Hypertension , 2017, Journal of clinical gastroenterology.

[58]  Federico Girosi,et al.  Clustering Multivariate Time Series Using Hidden Markov Models , 2014, International journal of environmental research and public health.

[59]  Hans-Hermann Bock,et al.  Two-mode clustering methods: astructuredoverview , 2004, Statistical methods in medical research.

[60]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[61]  Zhidong Cao,et al.  Prevalence and Risk Factors of Comorbidities among Hypertensive Patients in China , 2017, International journal of medical sciences.

[62]  Guizhen Yang,et al.  The complexity of mining maximal frequent itemsets and maximal frequent patterns , 2004, KDD.

[63]  O. Singh,et al.  Severe sepsis and septic shock in the elderly: An overview. , 2012, World journal of critical care medicine.

[64]  Søren Brunak,et al.  Using Electronic Patient Records to Discover Disease Correlations and Stratify Patient Cohorts , 2011, PLoS Comput. Biol..

[65]  Benjamin C. M. Fung,et al.  Privacy-preserving heterogeneous health data sharing , 2013, J. Am. Medical Informatics Assoc..

[66]  J. Denny,et al.  Intelligent use and clinical benefits of electronic health records in rheumatoid arthritis , 2015, Expert review of clinical immunology.

[67]  Teofilo F. Gonzalez,et al.  P-Complete Approximation Problems , 1976, J. ACM.

[68]  Fei Wang,et al.  An RNN Architecture with Dynamic Temporal Matching for Personalized Predictions of Parkinson's Disease , 2017, SDM.

[69]  Spiros Skiadopoulos,et al.  Anonymizing datasets with demographics and diagnosis codes in the presence of utility constraints , 2017, J. Biomed. Informatics.

[70]  Mohammed J. Zaki,et al.  Efficiently mining maximal frequent itemsets , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[71]  Liu Peng,et al.  Study on Comparison of Discretization Methods , 2009, 2009 International Conference on Artificial Intelligence and Computational Intelligence.

[72]  Vikram Pudi,et al.  Frequent Itemset Based Hierarchical Document Clustering Using Wikipedia as External Knowledge , 2010, KES.

[73]  Mohammed J. Zaki Data Mining and Analysis: Fundamental Concepts and Algorithms , 2014 .

[74]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[75]  Cheng-Li Lin,et al.  Association between gastroesophageal reflux disease and coronary heart disease , 2016, Medicine.

[76]  Charu C. Aggarwal,et al.  A Survey of Text Clustering Algorithms , 2012, Mining Text Data.

[77]  Fosca Giannotti,et al.  Clustering Transactional Data , 2002, PKDD.

[79]  Artur Czumaj,et al.  Small Space Representations for Metric Min-Sum k -Clustering and Their Applications , 2007, STACS.

[80]  Trupti M. Kodinariya,et al.  Review on determining number of Cluster in K-Means Clustering , 2013 .

[81]  N. Bansal,et al.  Heart failure in patients with kidney disease , 2017, Heart.

[82]  T. James,et al.  Access to Care in Vermont: Factors Linked With Time to Chemotherapy for Women With Breast Cancer-A Retrospective Cohort Study. , 2016, Journal of oncology practice.

[83]  Sanghyun Park,et al.  IMA: Identifying disease-related genes using MeSH terms and association rules , 2017, J. Biomed. Informatics.

[84]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[85]  Francisco Lopez-Jimenez,et al.  Interactions between obesity and obstructive sleep apnea: implications for treatment. , 2010, Chest.

[86]  Perry L. Miller,et al.  Journal of Biomedical Informatics 40 (2007) 750–760 , 2006 .

[87]  Teh Ying Wah,et al.  A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data , 2015, PloS one.

[88]  Jörn Lötsch,et al.  Machine-learned cluster identification in high-dimensional data , 2017, J. Biomed. Informatics.

[89]  Fei Wang,et al.  Association networks in a matched case-control design - Co-occurrence patterns of preexisting chronic medical conditions in patients with major depression versus their matched controls , 2018, J. Biomed. Informatics.

[90]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[91]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[92]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[93]  E. Friedman,et al.  Chronic kidney disease in the elderly: evaluation and management. , 2014, Clinical practice.