Categorization of Patient Diseases for Chinese Electronic Health Record Analysis: A Case Study

The electronic health record (EHR) analysis has become an increasingly important landing area for machine learning and text mining algorithms to leverage the full potential of the big data for improving human health care. In a lot of our Chinese EHR analysis applications, it is very important to categorize the patients’ diseases according to the Chinese national medical coding standard. In this paper, we develop NLP and machine learning algorithms to automatically categorize each patient’s diseases into one or more categories. We take each patient’s disease description as a document. Also, for each disease category, we make use of its description information in the medical coding standard and take it as a document. According to the characteristics of our data, we define the categorization problem as the unsupervised classification problem with the nearest neighborhood (NN) algorithm using different vector representations to represent the documents. Experimental results show that the averaged word embeddings of word2Vec works best with very promising classification performance.