Gimli: open source and high-performance biomedical name recognition

BackgroundAutomatic recognition of biomedical names is an essential task in biomedical information extraction, presenting several complex and unsolved challenges. In recent years, various solutions have been implemented to tackle this problem. However, limitations regarding system characteristics, customization and usability still hinder their wider application outside text mining research.ResultsWe present Gimli, an open-source, state-of-the-art tool for automatic recognition of biomedical names. Gimli includes an extended set of implemented and user-selectable features, such as orthographic, morphological, linguistic-based, conjunctions and dictionary-based. A simple and fast method to combine different trained models is also provided. Gimli achieves an F-measure of 87.17% on GENETAG and 72.23% on JNLPBA corpus, significantly outperforming existing open-source solutions.ConclusionsGimli is an off-the-shelf, ready to use tool for named-entity recognition, providing trained and optimized models for recognition of biomedical entities from scientific text. It can be used as a command line tool, offering full functionality, including training of new models and customization of the feature set and model parameters through a configuration file. Advanced users can integrate Gimli in their text mining workflows through the provided library, and extend or adapt its functionalities. Based on the underlying system characteristics and functionality, both for final users and developers, and on the reported performance results, we believe that Gimli is a state-of-the-art solution for biomedical NER, contributing to faster and better research in the field. Gimli is freely available at http://bioinformatics.ua.pt/gimli.

[1]  Richard Tzong-Han Tsai,et al.  Overview of BioCreative II gene mention recognition , 2008, Genome Biology.

[2]  Chun-Nan Hsu,et al.  Integrating high dimensional bi-directional parsing models for gene mention tagging , 2008, ISMB.

[3]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[4]  Gary Geunbae Lee,et al.  POSBIOTM-NER in the Shared Task of BioNLP/NLPBA2004 , 2004, NLPBA/BioNLP.

[5]  Hanna M. Wallach,et al.  Conditional Random Fields: An Introduction , 2004 .

[6]  Graciela Gonzalez,et al.  BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.

[7]  Gideon S. Mann,et al.  Generalized Expectation Criteria for Semi-Supervised Learning of Conditional Random Fields , 2008, ACL.

[8]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[9]  Guilherme Hoefel,et al.  Learning a two-stage SVM/CRF sequence classifier , 2008, CIKM '08.

[10]  Michael Collins,et al.  Ranking Algorithms for Named Entity Extraction: Boosting and the VotedPerceptron , 2002, ACL.

[11]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[12]  Martijn J. Schuemie,et al.  Evaluation of techniques for increasing recall in a dictionary approach to gene and protein name identification , 2007, J. Biomed. Informatics.

[13]  Mariana L. Neves,et al.  CBR-Tagger: a case-based reasoning approach to the gene/protein mention problem , 2008, BioNLP.

[14]  Burr Settles,et al.  ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[15]  Jian Su,et al.  Recognizing Names in Biomedical Texts: a Machine Learning Approach , 2004 .

[16]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[17]  Myung-Gil Jang,et al.  Fast Training of Structured SVM Using Fixed-Threshold Sequential Minimal Optimization , 2009 .

[18]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[19]  Lorraine K. Tanabe,et al.  GENETAG: a tagged corpus for gene/protein named entity recognition , 2005, BMC Bioinformatics.

[20]  Jun'ichi Tsujii,et al.  Dependency Parsing and Domain Adaptation with LR Models and Parser Ensembles , 2007, EMNLP.

[21]  Dietrich Rebholz-Schuhmann,et al.  BioLexicon: A Lexical Resource for the Biology Domain , 2008, SMBM 2008.

[22]  Wen-Lian Hsu,et al.  NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition , 2006, BMC Bioinformatics.

[23]  William W. Cohen,et al.  Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[24]  Man Lung Yiu,et al.  Group-by skyline query processing in relational engines , 2009, CIKM.

[25]  Rie Kubota Ando,et al.  BioCreative II Gene Mention Tagging System at IBM Watson , 2007 .

[26]  Malvina Nissim,et al.  Exploiting Context for Biomedical Entity Recognition: From Syntax to the Web , 2004, NLPBA/BioNLP.

[27]  Hongfang Liu,et al.  BioThesaurus: a web-based thesaurus of protein and gene names , 2006, Bioinform..

[28]  Dietrich Rebholz-Schuhmann,et al.  Harmonization of gene/protein annotations: towards a gold standard MEDLINE , 2012, Bioinform..

[29]  Andrew McCallum,et al.  Efficiently Inducing Features of Conditional Random Fields , 2002, UAI.

[30]  Cheng-Ju Kuo,et al.  Rich Feature Set, Unification of Bidirectional Parsing and Dictionary Filtering for High F-Score Gene Mention Tagging. , 2007 .

[31]  Jian Su,et al.  Recognition of protein/gene names from text using an ensemble of classifiers , 2005, BMC Bioinformatics.

[32]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.