Snomed2Vec: Random Walk and Poincaré Embeddings of a Clinical Knowledge Base for Healthcare Analytics

Representation learning methods that transform encoded data (e.g., diagnosis and drug codes) into continuous vector spaces (i.e., vector embeddings) are critical for the application of deep learning in healthcare. Initial work in this area explored the use of variants of the word2vec algorithm to learn embeddings for medical concepts from electronic health records or medical claims datasets. We propose learning embeddings for medical concepts by using graph-based representation learning methods on SNOMED-CT, a widely popular knowledge graph in the healthcare domain with numerous operational and research applications. Current work presents an empirical analysis of various embedding methods, including the evaluation of their performance on multiple tasks of biomedical relevance (node classification, link prediction, and patient state prediction). Our results show that concept embeddings derived from the SNOMED-CT knowledge graph significantly outperform state-of-the-art embeddings, showing 5-6x improvement in ``concept similarity" and 6-20\% improvement in patient diagnosis.

[1]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[2]  Tianxi Cai,et al.  Clinical Concept Embeddings Learned from Massive Sources of Medical Data , 2018, ArXiv.

[3]  Douwe Kiela,et al.  Poincaré Embeddings for Learning Hierarchical Representations , 2017, NIPS.

[4]  Le Song,et al.  GRAM: Graph-based Attention Model for Healthcare Representation Learning , 2016, KDD.

[5]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[6]  Andrew M. Dai,et al.  Embedding Text in Hyperbolic Spaces , 2018, TextGraphs@NAACL-HLT.

[7]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[8]  Yixin Chen,et al.  Link Prediction Based on Graph Neural Networks , 2018, NeurIPS.

[9]  Matthias Samwald,et al.  Applying deep learning techniques on medical corpora from the World Wide Web: a prototypical system and evaluation , 2015, ArXiv.

[10]  Guido Zuccon,et al.  Medical Semantic Similarity with a Neural Language Model , 2014, CIKM.

[11]  A. McCray The UMLS Semantic Network. , 1989 .

[12]  Jimeng Sun,et al.  Multi-layer Representation Learning for Medical Concepts , 2016, KDD.

[13]  Nitesh V. Chawla,et al.  metapath2vec: Scalable Representation Learning for Heterogeneous Networks , 2017, KDD.

[14]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[15]  Kevin Donnelly,et al.  SNOMED-CT: The advanced terminology and coding system for eHealth. , 2006, Studies in health technology and informatics.