Effects of Semantic Features on Machine Learning-Based Drug Name Recognition Systems: Word Embeddings vs. Manually Constructed Dictionaries

Semantic features are very important for machine learning-based drug name recognition (DNR) systems. The semantic features used in most DNR systems are based on drug dictionaries manually constructed by experts. Building large-scale drug dictionaries is a time-consuming task and adding new drugs to existing drug dictionaries immediately after they are developed is also a challenge. In recent years, word embeddings that contain rich latent semantic information of words have been widely used to improve the performance of various natural language processing tasks. However, they have not been used in DNR systems. Compared to the semantic features based on drug dictionaries, the advantage of word embeddings lies in that learning them is unsupervised. In this paper, we investigate the effect of semantic features based on word embeddings on DNR and compare them with semantic features based on three drug dictionaries. We propose a conditional random fields (CRF)-based system for DNR. The skip-gram model, an unsupervised algorithm, is used to induce word embeddings on about 17.3 GigaByte (GB) unlabeled biomedical texts collected from MEDLINE (National Library of Medicine, Bethesda, MD, USA). The system is evaluated on the drug-drug interaction extraction (DDIExtraction) 2013 corpus. Experimental results show that word embeddings significantly improve the performance of the DNR system and they are competitive with semantic features based on drug dictionaries. F-score is improved by 2.92 percentage points when word embeddings are added into the baseline system. It is comparative with the improvements from semantic features based on drug dictionaries. Furthermore, word embeddings are complementary to the semantic features based on drug dictionaries. When both word embeddings and semantic features based on drug dictionaries are added, the system achieves the best performance with an F-score of 78.37%, which outperforms the best system of the DDIExtraction 2013 challenge by 6.87 percentage points.

[1]  Min Li,et al.  High accuracy information extraction of medication information from clinical notes: 2009 i2b2 medication extraction challenge , 2010, J. Am. Medical Informatics Assoc..

[2]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[3]  Zhiyong Lu,et al.  tmChem: a high performance approach for chemical named entity recognition and normalization , 2015, Journal of Cheminformatics.

[4]  Isabel Segura-Bedmar,et al.  Drug name recognition and classification in biomedical texts. A case study outlining approaches underpinning automated systems. , 2008, Drug discovery today.

[5]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[6]  Jun Zhao,et al.  How to Generate a Good Word Embedding , 2015, IEEE Intelligent Systems.

[7]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[8]  Ulf Leser,et al.  WBI-NER: The impact of domain-specific features on the performance of identifying and classifying mentions of drugs , 2013, *SEMEVAL.

[9]  Burr Settles,et al.  Biomedical Named Entity Recognition using Conditional Random Fields and Rich Feature Sets , 2004, NLPBA/BioNLP.

[10]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[11]  Paloma Martínez,et al.  Combining dictionaries and ontologies for drug name recognition in biomedical texts , 2013, DTMBIO '13.

[12]  Martijn J. Schuemie,et al.  A dictionary to identify small molecules and drugs in free text , 2009, Bioinform..

[13]  Francisco M. Couto,et al.  LASIGE: using Conditional Random Fields and ChEBI ontology , 2013, SemEval@NAACL-HLT.

[14]  José Luís Oliveira,et al.  A document processing pipeline for annotating chemical entities in scientific documents , 2015, Journal of Cheminformatics.

[15]  Fernando Pereira,et al.  Identifying gene and protein mentions in text using conditional random fields , 2005, BMC Bioinformatics.

[16]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[17]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[18]  Xiaolong Wang,et al.  Evaluating Word Representation Features in Biomedical Named Entity Recognition Tasks , 2014, BioMed research international.

[19]  Xiaolong Wang,et al.  A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature , 2015, Journal of Cheminformatics.

[20]  Andrew Y. Ng,et al.  Improving Word Representations via Global Context and Multiple Word Prototypes , 2012, ACL.

[21]  Hongfei Lin,et al.  Drug name recognition in biomedical texts: a machine-learning-based method. , 2014, Drug discovery today.

[22]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[23]  David S. Wishart,et al.  DrugBank 3.0: a comprehensive resource for ‘Omics’ research on drugs , 2010, Nucleic Acids Res..

[24]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[25]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[26]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[27]  Xiaohui Liang,et al.  CHEMDNER system with mixed conditional random fields and multi-scale word clustering , 2015, Journal of Cheminformatics.

[28]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[29]  Paloma Martínez,et al.  SemEval-2013 Task 9 : Extraction of Drug-Drug Interactions from Biomedical Texts (DDIExtraction 2013) , 2013, *SEMEVAL.

[30]  Daniel Sánchez-Cisneros,et al.  UEM-UC3M: An Ontology-based named entity recognition system for biomedical texts. , 2013, *SEMEVAL.

[31]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[32]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[33]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[34]  Siddhartha Jonnalagadda,et al.  Enhancing clinical concept extraction with distributional semantics , 2012, J. Biomed. Informatics.

[35]  Hua Xu,et al.  A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries , 2011, J. Am. Medical Informatics Assoc..

[36]  Andrew McCallum,et al.  Chinese Segmentation and New Word Detection using Conditional Random Fields , 2004, COLING.

[37]  Olivier Bodenreider,et al.  The NLM Indexing Initiative , 2000, AMIA.

[38]  Jari Björne,et al.  UTurku: Drug Named Entity Recognition and Drug-Drug Interaction Extraction Using SVM Classification and Domain Knowledge , 2013, SemEval@NAACL-HLT.

[39]  Maria Kvist,et al.  Identifying adverse drug event information in clinical notes with distributional semantic representations of context , 2015, J. Biomed. Informatics.

[40]  Son Doan,et al.  Recognition of medication information from discharge summaries using ensembles of classifiers , 2012, BMC Medical Informatics and Decision Making.

[41]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[42]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[43]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[44]  Maria Kvist,et al.  Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study , 2014, J. Biomed. Informatics.

[45]  Alfonso Valencia,et al.  CHEMDNER: The drugs and chemical names extraction challenge , 2015, Journal of Cheminformatics.

[46]  Isabel Segura-Bedmar,et al.  The 1st DDIExtraction-2011 challenge task: Extraction of Drug-Drug Interactions from biomedical texts , 2011 .

[47]  Sophia Ananiadou,et al.  Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics , 2015, Journal of Cheminformatics.

[48]  Andrew McCallum,et al.  Lexicon Infused Phrase Embeddings for Named Entity Resolution , 2014, CoNLL.

[49]  Rafael Muñoz,et al.  UMCC_DLSI: Semantic and Lexical features for detection and classification Drugs in biomedical texts , 2013, SemEval@NAACL-HLT.

[50]  Maksim Tkatchenko,et al.  Named entity recognition: Exploring features , 2012, KONVENS.

[51]  Son Doan,et al.  Application of information technology: MedEx: a medication information extraction system for clinical narratives , 2010, J. Am. Medical Informatics Assoc..

[52]  Fei Xia,et al.  A cascade of classifiers for extracting medication information from discharge summaries , 2011, J. Biomed. Semant..

[53]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[54]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.