Extracting and Normalizing Gene/Protein Mentions with the Flexible and Trainable Moara Java Library

Gene/protein recognition and normalization are important prerequisite steps for many biological text mining tasks. Even if great efforts have been dedicated to these problems and effective solutions have been reported, the availability of easily integrated tools to perform these tasks is still deficient. We therefore propose Moara, a Java library that implements gene/protein recognition and normalization steps based on machine learning approaches. The system may be trained with extra documents for the recognition procedure and new organism may be added in the normalization step. The novelty of the methodology used in Moara lies in the design of a system that is not tailored to a specific organism and therefore does not need any organism-dependent tuning in the algorithms and in the dictionaries it uses. Moara can be used either as a standalone application or incorporated in a text mining system and it is available at: http://moara.dacya.ucm.es

[1]  Biological Laboratories Divinity Avenue Cambridge Ma Usa. FlyBase FlyBase: a Drosophila database. , 1998, Nucleic acids research.

[2]  Richárd Farkas,et al.  The strength of co-authorship in gene name disambiguation , 2008, BMC Bioinformatics.

[3]  David Botstein,et al.  SGD: Saccharomyces Genome Database , 1998, Nucleic Acids Res..

[4]  Fernando Pereira,et al.  Identifying gene and protein mentions in text using conditional random fields , 2005, BMC Bioinformatics.

[5]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[6]  Sophia Ananiadou,et al.  Learning string similarity measures for gene/protein name dictionary look-up using logistic regression , 2007, Bioinform..

[7]  Michael Schroeder,et al.  Inter-species normalization of gene mentions with GNAT , 2008, ECCB.

[8]  T. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2006, Nucleic Acids Res..

[9]  Matthew Chalmers,et al.  System level visualization of eQTLs and pQTLs , 2005, BMC Bioinformatics.

[10]  Agnar Aamodt,et al.  Case-Based Reasoning: Foundational Issues, Methodological Variations, and System Approaches , 1994, AI Commun..

[11]  Fernando Pereira,et al.  Automatically annotating documents with normalized gene lists , 2005, BMC Bioinformatics.

[12]  Fabien Campagne,et al.  Critical evaluation of the JDO API for the persistence and portability requirements of complex biological databases , 2005, BMC Bioinformatics.

[13]  Ian Witten,et al.  Data Mining , 2000 .

[14]  Hongfang Liu,et al.  BioTagger: A Biological Entity Tagging System , 2004 .

[15]  Ralf Zimmer,et al.  A simple approach for protein name identification: prospects and limits , 2005, BMC Bioinformatics.

[16]  Stavros J. Hamodrakas,et al.  Evaluation of methods for predicting the topology of β-barrel outer membrane proteins and a consensus prediction method , 2005, BMC Bioinformatics.

[17]  Jian Su,et al.  Recognition of protein/gene names from text using an ensemble of classifiers , 2005, BMC Bioinformatics.

[18]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[19]  David R. Gilbert,et al.  FlyBase: a Drosophila database. The FlyBase consortium , 1997, Nucleic Acids Res..

[20]  Graciela Gonzalez,et al.  BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.

[21]  Dave Bridges,et al.  Cyclic nucleotide binding proteins in the Arabidopsis thaliana and Oryza sativa genomes , 2005, BMC Bioinformatics.

[22]  Mariana L. Neves,et al.  CBR-Tagger: a case-based reasoning approach to the gene/protein mention problem , 2008, BioNLP.

[23]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[24]  Hongfang Liu,et al.  BioThesaurus: a web-based thesaurus of protein and gene names , 2006, Bioinform..

[25]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[26]  Hagit Shatkay,et al.  Mining the Biomedical Literature in the Genomic Era: An Overview , 2003, J. Comput. Biol..

[27]  Malvina Nissim,et al.  Exploring the boundaries: gene and protein identification in biomedical text , 2005, BMC Bioinformatics.

[28]  K. Bretonnel Cohen,et al.  U-Compare: share and compare text mining tools with UIMA , 2009, Bioinform..

[29]  K. Cohen,et al.  Overview of BioCreative II gene normalization , 2008, Genome Biology.

[30]  George Hripcsak,et al.  Gene symbol disambiguation using knowledge-based profiles , 2007, Bioinform..

[31]  Judith A. Blake,et al.  The Mouse Genome Database (MGD): from genes to mice—a community resource for mouse biology , 2004, Nucleic Acids Res..

[32]  Graeme Grimes GPX – An Integrative Environment for the Storage and Retrieval of Raw and Processed Microarray Data , 2005, BMC Bioinformatics.

[33]  Burr Settles ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[34]  Walter Daelemans,et al.  MBT: A Memory-Based Part of Speech Tagger-Generator , 1996, VLC@COLING.

[35]  Alexander A. Morgan,et al.  Overview of BioCreAtIvE task 1B: normalized gene lists , 2005, BMC Bioinformatics.

[36]  Richard Tzong-Han Tsai,et al.  Overview of BioCreative II gene mention recognition , 2008, Genome Biology.

[37]  Robert M. Seymour,et al.  Using large-scale perturbations in gene network reconstruction , 2005, BMC Bioinformatics.