Word Embedding Clustering for Disease Named Entity Recognition

This paper reports the use of a machine learning-based approach with word embedding features for the Disease Named Entity Recognition and Normalization subtask of the BioCreative V ChemicalDisease Relation (CDR) challenge task. Firstly, we developed a feature extraction phase with standard features used in current Named Entity Recognition (NER) systems. Then, we compared the use of the word vectors and the word clusters generated by the Word2Vec tool to add the best of both in the feature set. For this purpose, we trained the Word2Vec models over Wikipedia and MedLine as corpora. Our results suggest that the use of word clusters improves 28% in F-score in disease mention recognition and increases precision almost 49% in the normalization task over the baseline system provided by the organizers.