Word Embedding for the French Natural Language in Health Care: Comparative Study

Background Word embedding technologies, a set of language modeling and feature learning techniques in natural language processing (NLP), are now used in a wide range of applications. However, no formal evaluation and comparison have been made on the ability of each of the 3 current most famous unsupervised implementations (Word2Vec, GloVe, and FastText) to keep track of the semantic similarities existing between words, when trained on the same dataset. Objective The aim of this study was to compare embedding methods trained on a corpus of French health-related documents produced in a professional context. The best method will then help us develop a new semantic annotator. Methods Unsupervised embedding models have been trained on 641,279 documents originating from the Rouen University Hospital. These data are not structured and cover a wide range of documents produced in a clinical setting (discharge summary, procedure reports, and prescriptions). In total, 4 rated evaluation tasks were defined (cosine similarity, odd one, analogy-based operations, and human formal evaluation) and applied on each model, as well as embedding visualization. Results Word2Vec had the highest score on 3 out of 4 rated tasks (analogy-based operations, odd one similarity, and human validation), particularly regarding the skip-gram architecture. Conclusions Although this implementation had the best rate for semantic properties conservation, each model has its own qualities and defects, such as the training time, which is very short for GloVe, or morphological similarity conservation observed with FastText. Models and test sets produced by this study will be the first to be publicly available through a graphical interface to help advance the French biomedical research.

[1]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[2]  S. F. Dierk The SMART retrieval system: Experiments in automatic document processing — Gerard Salton, Ed. (Englewood Cliffs, N.J.: Prentice-Hall, 1971, 556 pp., $15.00) , 1972 .

[3]  Tamara G. Kolda,et al.  A semidiscrete matrix decomposition for latent semantic indexing information retrieval , 1998, TOIS.

[4]  Samy Bengio,et al.  Taking on the curse of dimensionality in joint distributions using neural networks , 2000, IEEE Trans. Neural Networks Learn. Syst..

[5]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[6]  Jivko Sinapov,et al.  The odd one out task: Toward an intelligence test for robots , 2010, 2010 IEEE 9th International Conference on Development and Learning.

[7]  M. McHugh Interrater reliability: the kappa statistic , 2012, Biochemia medica.

[8]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[9]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[10]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[11]  Sunil Kumar Sahu,et al.  Evaluating distributed word representations for capturing semantics of biomedical concepts , 2015, BioNLP@IJCNLP.

[12]  Jian Huang,et al.  Analyzing Multiple Medical Corpora Using Word Embedding , 2016, 2016 IEEE International Conference on Healthcare Informatics (ICHI).

[13]  Nicolas Griffon,et al.  [LiSSa: An alternative in French to browse health scientific literature ?] , 2016, Presse medicale.

[14]  Sampo Pyysalo,et al.  How to Train good Word Embeddings for Biomedical NLP , 2016, BioNLP@ACL.

[15]  Stéfan Jacques Darmoni,et al.  Littérature Scientifique en Santé (LiSSa) : une alternative à l’anglais ? , 2016 .

[16]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[17]  Hongfang Liu,et al.  A Comparison of Word Embeddings for the Biomedical Natural Language Processing , 2018, J. Biomed. Informatics.

[18]  Stéfan Jacques Darmoni,et al.  Accuracy of using natural language processing methods for identifying healthcare-associated infections , 2018, Int. J. Medical Informatics.

[19]  Tianxi Cai,et al.  Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data , 2018, PSB.