A study of damp-heat syndrome classification using Word2vec and TF-IDF

With people's increasing concern about health, judging people's health through medical record is becoming a potential demand. Most of preview disease analysis researches were conducted on structured dataset, which usually ignored the relationship between different symptoms, and the dataset was expensive to get. In this paper, we proposed a novel model based on Word2vec and Terms Frequency-Inverse Document Frequency (TF-IDF), which could be used to detect damp-heat syndrome on unstructured records directly. Firstly, we adopt ICTCLAS system combined with corpus collected in the field of Traditional Chinese Medicine (TCM) to segment the clinical records into words. Secondly, Word2vec tool was used to train word vector. Then, we constructed the record representation vector according to word vector and TF-IDF. The record representation method was named Word2vec+TF-IDF. In order to verify the effectiveness of the proposed method, we compared our record representation method with other text representation methods under four different classifiers. The experiment was conducted on the dataset collected from over 10 Chinese Medicine hospitals. And the experimental results show that our model perform better than the state-of-the-art methods such as LSA and Doc2vec.

[1]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[2]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[3]  Mohammed Yeasin,et al.  Empirical study using network of semantically related associations in bridging the knowledge gap , 2014, Journal of Translational Medicine.

[4]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[5]  Qun Liu,et al.  HHMM-based Chinese Lexical Analyzer ICTCLAS , 2003, SIGHAN.

[6]  Yi Lu Murphey,et al.  Neural Network Approaches for Text Document Categorization , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[7]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[8]  Hua Xu,et al.  Chinese comments sentiment classification based on word2vec and SVMperf , 2015, Expert Syst. Appl..

[9]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[10]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[11]  Yoshua Bengio,et al.  Neural Probabilistic Language Models , 2006 .

[12]  Lijun Liu,et al.  An Efficient Method for Document Categorization Based on Word2vec and Latent Semantic Analysis , 2015, 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing.

[13]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[14]  Quoc V. Le,et al.  Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[15]  A. Hoecker,et al.  SVD APPROACH TO DATA UNFOLDING , 1995, hep-ph/9509307.

[16]  Chen Fu,et al.  A Study on Sentiment Computing and Classification of Sina Weibo with Word2vec , 2014, 2014 IEEE International Congress on Big Data.

[17]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[18]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .