Learning adaptive representations for entity recognition in the biomedical domain

Background Named Entity Recognition is a common task in Natural Language Processing applications, whose purpose is to recognize named entities in textual documents. Several systems exist to solve this task in the biomedical domain, based on Natural Language Processing techniques and Machine Learning algorithms. A crucial step of these applications is the choice of the representation which describes data. Several representations have been proposed in the literature, some of which are based on a strong knowledge of the domain, and they consist of features manually defined by domain experts. Usually, these representations describe the problem well, but they require a lot of human effort and annotated data. On the other hand, general-purpose representations like word-embeddings do not require human domain knowledge, but they could be too general for a specific task. Results This paper investigates methods to learn the best representation from data directly, by combining several knowledge-based representations and word embeddings. Two mechanisms have been considered to perform the combination, which are neural networks and Multiple Kernel Learning. To this end, we use a hybrid architecture for biomedical entity recognition which integrates dictionary look-up (also known as gazetteers) with machine learning techniques. Results on the CRAFT corpus clearly show the benefits of the proposed algorithm in terms of F 1 score. Conclusions Our experiments show that the principled combination of general, domain specific, word-, and character-level representations improves the performance of entity recognition. We also discussed the contribution of each representation in the final solution.

[1]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[2]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[3]  Eric Nichols,et al.  Named Entity Recognition with Bidirectional LSTM-CNNs , 2015, TACL.

[4]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[5]  Zhiyong Lu,et al.  tmChem: a high performance approach for chemical named entity recognition and normalization , 2015, Journal of Cheminformatics.

[6]  Mukund Sanglikar,et al.  Named Entity Recognition System for Hindi Language: A Hybrid Approach , 2011 .

[7]  Clément Chatelain,et al.  Exploring multiple feature combination strategies with a recurrent neural network architecture for off-line handwriting recognition , 2015, Electronic Imaging.

[8]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[9]  Firoj Alam,et al.  A knowledge-poor approach to chemical-disease relation extraction , 2016, Database J. Biol. Databases Curation.

[10]  Ramakanth Kavuluru,et al.  Convolutional neural networks for biomedical text classification: application in indexing biomedical articles , 2015, BCB.

[11]  M. Ashburner,et al.  An ontology for cell types , 2005, Genome Biology.

[12]  Erik M. van Mulligen,et al.  Chemical entity recognition in patents by combining dictionary-based and statistical approaches , 2016, Database J. Biol. Databases Curation.

[13]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[14]  Fabio Rinaldi,et al.  A Combined Resource of Biomedical Terminology and its Statistics , 2015, TIA.

[15]  Khaled Shaalan,et al.  A hybrid approach to Arabic named entity recognition , 2014, J. Inf. Sci..

[16]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[17]  Girish Chavan,et al.  NOBLE – Flexible concept recognition for large-scale biomedical natural language processing , 2016, BMC Bioinformatics.

[18]  Kenji Suzuki,et al.  Artificial Neural Networks - Methodological Advances and Biomedical Applications , 2011 .

[19]  Xiaolong Wang,et al.  A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature , 2015, Journal of Cheminformatics.

[20]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[22]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[23]  Steven Bethard,et al.  A Survey on Recent Advances in Named Entity Recognition from Deep Learning models , 2018, COLING.

[24]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[25]  Viachaslau Sazonau,et al.  Transfer Learning for Biomedical Named Entity Recognition with BioBERT , 2019, SEMANTiCS.

[26]  Fabio Rinaldi,et al.  Entity recognition in the biomedical domain using a hybrid approach , 2017, J. Biomed. Semant..

[27]  Duangdao Wichadakul,et al.  ChemEx: information extraction system for chemical data curation , 2012, BMC Bioinformatics.

[28]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[29]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[30]  Johan A. K. Suykens,et al.  L2-norm multiple kernel learning and its application to biomedical data fusion , 2010, BMC Bioinformatics.

[31]  Nico Pfeifer,et al.  Integrating different data types by regularized unsupervised multiple kernel learning with application to cancer subtype discovery , 2015, Bioinform..

[32]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[33]  Michael Darsow,et al.  ChEBI: a database and ontology for chemical entities of biological interest , 2007, Nucleic Acids Res..

[34]  Maryam Habibi,et al.  Deep learning with word embeddings improves biomedical named entity recognition , 2017, Bioinform..

[35]  Scott Federhen,et al.  The NCBI Taxonomy database , 2011, Nucleic Acids Res..

[36]  Fabio Aiolli,et al.  EasyMKL: a scalable multiple kernel learning algorithm , 2015, Neurocomputing.

[37]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[38]  Giuseppe Sartori,et al.  Psychiatric Disorders Classification with 3D Convolutional Neural Networks , 2019, INNSBDDL.

[39]  K. Bretonnel Cohen,et al.  Concept annotation in the CRAFT corpus , 2012, BMC Bioinformatics.

[40]  Ulf Leser,et al.  ChemSpot: a hybrid system for chemical named entity recognition , 2012, Bioinform..

[41]  Xin Yu,et al.  BioBERT Based Named Entity Recognition in Electronic Medical Record , 2019, 2019 10th International Conference on Information Technology in Medicine and Education (ITME).

[42]  José Luís Oliveira,et al.  Biomedical Named Entity Recognition: A Survey of Machine-Learning Tools , 2012 .

[43]  Publisher's Note , 2018, Anaesthesia.

[44]  Fei Zhu,et al.  Named Entity Recognition from Biomedical Text Using SVM , 2011, 2011 5th International Conference on Bioinformatics and Biomedical Engineering.

[45]  Sampo Pyysalo,et al.  A neural network multi-task learning approach to biomedical named entity recognition , 2017, BMC Bioinformatics.

[46]  Ethem Alpaydin,et al.  Multiple Kernel Learning Algorithms , 2011, J. Mach. Learn. Res..

[47]  Juho Rousu,et al.  Metabolite identification through multiple kernel learning on fragmentation trees , 2014, Bioinform..

[48]  Hae-Chang Rim,et al.  Two-Phase Biomedical NE Recognition based on SVMs , 2003, BioNLP@ACL.

[49]  Burr Settles,et al.  Biomedical Named Entity Recognition using Conditional Random Fields and Rich Feature Sets , 2004, NLPBA/BioNLP.

[50]  R. Durbin,et al.  The Sequence Ontology: a tool for the unification of genome annotations , 2005, Genome Biology.

[51]  Ani Nenkova,et al.  Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , 2016, NAACL 2016.

[52]  Fabio Rinaldi,et al.  OGER: OntoGene’s Entity Recogniser in the BeCalm TIPS Task , 2017 .

[53]  Keun Ho Ryu,et al.  Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations , 2015, Journal of Cheminformatics.