Estimation of Disease Code from Electronic Patient Records

This paper proposes a method which classifies discharge summaries stored in hospital information system, which consists of the following four steps. First, a term matrix of the set of summaries is induced by morphological analysis (RMecab). Next, correspondence analysis is applied to the term matrix and numerical values of two dimensional coordinates are assigned to each keyword and each concept. By measuring the euclidean distance between categories and keywords, keywords are ordered. Then, keywords are selected as attributes according to the rank, and training examples for classifiers will be generated. Finally, learning methods are applied to the training examples. Experimental validation shows that random forest achieved the best performance and deep learning (multiple layer perceptron) is the second best.