ICD2Vec: Mathematical representation of diseases

The International Classification of Diseases (ICD) codes represent the global standard for reporting disease conditions. The current ICD codes are hierarchically structured and only connote partial relationships among diseases. Therefore, it is important to represent the ICD codes as mathematical vectors to indicate the complex relationships across diseases. Here, we proposed a framework denoted "ICD2Vec" for providing mathematical representations of diseases by encoding corresponding information. First, we presented the arithmetic and semantic relationships between diseases by mapping composite vectors for symptoms or diseases to the most similar ICD codes. Second, we confirmed the validity of ICD2Vec by comparing the biological relationships and cosine similarities among the vectorized ICD codes. Third, we proposed a new risk score derived from ICD2Vec, and demonstrated its potential clinical utility for coronary artery disease, type 2 diabetes, dementia, and liver cancer, based on a large prospective cohort from the UK and large electronic medical records from a medical center in South Korea. In summary, ICD2Vec is applicable for diverse quantitative analyses using ICD codes in biomedical research.

[1]  Jürgen Stausberg,et al.  Reliability of diagnoses coding with ICD-10 , 2008, Int. J. Medical Informatics.

[2]  Ji Hwan Park,et al.  Machine learning prediction of incidence of Alzheimer's disease using large-scale administrative health data. , 2019, NPJ digital medicine.

[3]  Rajesh Ranganath,et al.  ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission , 2019, ArXiv.

[4]  B. Miller,et al.  Improving Diagnosis in Health Care. , 2016, Military medicine.

[5]  K. Bowles,et al.  ICD-9 to ICD-10: evolution, revolution, and current debates in the United States. , 2013, Perspectives in health information management.

[6]  Richard A. Goodman,et al.  AHA/ACC/HHS strategies to enhance application of clinical practice guidelines in patients with cardiovascular disease and comorbid conditions: from the American Heart Association, American College of Cardiology, and US Department of Health and Human Services. , 2014, Circulation.

[7]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[8]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[9]  M. Drazner,et al.  2013 ACCF/AHA guideline for the management of heart failure: a report of the American College of Cardiology Foundation/American Heart Association Task Force on Practice Guidelines. , 2013, Journal of the American College of Cardiology.

[10]  Amir H. Payberah,et al.  Deep learning for electronic health records: A comparative review of multiple deep neural architectures , 2020, J. Biomed. Informatics.

[11]  D. Suits Use of Dummy Variables in Regression Equations , 1957 .

[12]  Hude Quan,et al.  The Development, Evolution, and Modifications of ICD-10: Challenges to the International Comparability of Morbidity Data , 2010, Medical care.

[13]  D L CROMBIE,et al.  DIAGNOSTIC PROCESS. , 1963, The Journal of the College of General Practitioners.

[14]  P. Austin,et al.  Derivation and External Validation of Prediction Models for Advanced Chronic Kidney Disease Following Acute Kidney Injury , 2017, JAMA.

[15]  L. Manchikanti,et al.  ICD-10: History and Context , 2016, American Journal of Neuroradiology.

[16]  S. Brunak,et al.  Survival prediction in intensive-care units based on aggregation of long-term disease history and acute physiology: a retrospective study of the Danish National Patient Registry and electronic patient records. , 2019, The Lancet. Digital health.

[17]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[18]  A. Barabasi,et al.  Uncovering disease-disease relationships through the incomplete interactome , 2015, Science.

[19]  G. Diamond,et al.  Analysis of probability as an aid in the clinical diagnosis of coronary-artery disease. , 1979, The New England journal of medicine.

[20]  R. Grainger II. Interstitial pulmonary oedema and its radiological diagnosis: a sign of pulmonary venous and capillary hypertension. , 1958, The British journal of radiology.

[21]  Xianqun Fan,et al.  Loop Myopexy Surgery for Strabismus Associated with High Myopia , 2016, Journal of ophthalmology.

[22]  Wei-Hung Weng,et al.  Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[23]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[24]  J. Hageman The Coronavirus Disease 2019 (COVID-19). , 2020, Pediatric annals.

[25]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[26]  A. Street,et al.  Development and validation of a Hospital Frailty Risk Score focusing on older people in acute care settings using electronic hospital records: an observational study , 2018, The Lancet.

[27]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[28]  Sandro Sperandei,et al.  Understanding logistic regression analysis , 2014, Biochemia medica.

[29]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[30]  N. Diehl,et al.  The development of myopia among children with intermittent exotropia. , 2010, American journal of ophthalmology.

[31]  Julia Adler-Milstein,et al.  Electronic health record adoption in US hospitals: the emergence of a digital “advanced use” divide , 2017, J. Am. Medical Informatics Assoc..

[32]  Anders Larsson,et al.  Use of multiple biomarkers to improve the prediction of death from cardiovascular causes. , 2008, The New England journal of medicine.

[33]  M. Daly,et al.  An Atlas of Genetic Correlations across Human Diseases and Traits , 2015, Nature Genetics.

[34]  Jimeng Sun,et al.  Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review , 2018, J. Am. Medical Informatics Assoc..

[35]  Ji Hwan Park,et al.  Machine learning prediction of incidence of Alzheimer’s disease using large-scale administrative health data , 2020, npj Digital Medicine.

[36]  Li Li,et al.  Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records , 2016, Scientific Reports.

[37]  K. Kim,et al.  What Is COVID-19? , 2020, Frontiers for Young Minds.

[38]  Donna K Arnett,et al.  AHA/ACC/HHS strategies to enhance application of clinical practice guidelines in patients with cardiovascular disease and comorbid conditions: from the American Heart Association, American College of Cardiology, and U.S. Department of Health and Human Services. , 2014, Journal of the American College of Cardiology.

[39]  Jennifer G. Robinson,et al.  2013 ACC/AHA Guideline on the Assessment of Cardiovascular Risk: A Report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines , 2014, Circulation.