Text mining and protein annotations: the construction and use of protein description sentences.

Existing biological knowledge stored as structured database records has been extracted manually by database curators analyzing the scientific literature. Most of this information was derived from sentences which describe biologically relevant aspects of genes and gene products. We introduce the Protein description sentence (Prodisen) corpus, a useful resource for the automatic identification and construction of text-based protein and gene description records using information extraction and text classification techniques. Basic guidelines and criteria relevant for the construction of a text corpus of functional descriptions of genes and proteins are proposed. The steps used for the corpus construction and its features are presented. Moreover, some of the potential applications of the Prodisen corpus for biomedical text mining purposes are explored and the obtained results are presented.

[1]  C. Blaschke,et al.  The potential use of SUISEKI as a protein interaction discovery tool. , 2001, Genome informatics. International Conference on Genome Informatics.

[2]  Burr Settles,et al.  Biomedical Named Entity Recognition using Conditional Random Fields and Rich Feature Sets , 2004, NLPBA/BioNLP.

[3]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology , 2003, Nucleic Acids Res..

[4]  Emily Dimmer,et al.  An evaluation of GO annotation retrieval for BioCreAtIvE and GOA , 2005, BMC Bioinformatics.

[5]  Alfonso Valencia,et al.  A sentence sliding window approach to extract protein annotations from biomedical articles , 2005, BMC Bioinformatics.

[6]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[7]  Harold L. Kundel,et al.  Measurement of Observer Agreement Measurement of Agreement of Two Readers , 2003 .

[8]  K. Bretonnel Cohen,et al.  Corpus Design for Biomedical Natural Language Processing , 2005, LBLODMBS@IDMB.

[9]  Fredrik Olsson,et al.  Protein names and how to find them , 2002, Int. J. Medical Informatics.

[10]  Alexander A. Morgan,et al.  BioCreAtIvE Task 1A: gene mention finding evaluation , 2005, BMC Bioinformatics.

[11]  H. Kundel,et al.  Measurement of observer agreement. , 2003, Radiology.

[12]  I. Gram,et al.  Percentage density, Wolfe's and Tabár's mammographic patterns: agreement and association with risk factors for breast cancer , 2005, Breast Cancer Research.

[13]  William R. Hersh,et al.  Evaluation of biomedical text-mining systems: Lessons learned from information retrieval , 2005, Briefings Bioinform..

[14]  P. Bork,et al.  Association of genes to genetically inherited diseases using data mining , 2002, Nature Genetics.

[15]  Nigel Collier,et al.  PASBio: predicate-argument structures for event extraction in molecular biology , 2004, BMC Bioinformatics.

[16]  Emily Dimmer,et al.  The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology , 2004, Nucleic Acids Res..

[17]  Jeffrey T. Chang,et al.  Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. , 2002, Genome research.

[18]  Alfonso Valencia,et al.  Evaluation of BioCreAtIvE assessment of task 2 , 2005, BMC Bioinformatics.

[19]  Joyce A. Mitchell,et al.  Gene Indexing: Characterization and Analysis of NLM's GeneRIFs , 2003, AMIA.

[20]  Marti A. Hearst,et al.  Predicting Gene Functions from Text Using a Cross-Species Approach , 2005, Pacific Symposium on Biocomputing.