Combining gene sequence similarity and textual information for gene function annotation in the literature

Annotation of the functions of genes and proteins is an essential step in genome analysis. Information extraction techniques have been proposed to obtain the function information of genes and proteins in the biomedical literature. However, the performance of most information extraction techniques of function annotation in the biomedical literature is not satisfactory due to the large variability in the expression of concepts in the biomedical literature. This paper proposes a framework to improve the gene function annotation in the literature by considering both the textual information in the literature and the functions of genes with sequences similar to a target gene. The new framework collects multiple types of evidence as: (i) textual information about gene functions by matching keywords of the gene functions; (ii) gene function information from the known functions of genes with sequences similar to a target gene; and (iii) the prior probabilities of gene functions to be associated with an arbitrary gene by mining the known gene functions from curated databases. A supervised learning method is utilized to obtain the weights for combining the three types of evidence to assign appropriate Gene Ontology terms for target genes. Empirical studies on two testbeds demonstrate that the combination of sequence similarity scores, function prior probabilities and textual information improves the accuracy of gene function annotation in the literature. The F-measure scores obtained with the proposed framework are substantially higher than the scores of the solutions in prior research.

[1]  Karin M. Verspoor,et al.  Protein annotation as term categorization in the gene ontology using word proximity networks , 2005, BMC Bioinformatics.

[2]  T. Minka A comparison of numerical optimizers for logistic regression , 2004 .

[3]  Avi Shoshan,et al.  Large-scale protein annotation through gene ontology. , 2002, Genome research.

[4]  Alfonso Valencia,et al.  Evaluation of BioCreAtIvE assessment of task 2 , 2005, BMC Bioinformatics.

[5]  Alfonso Valencia,et al.  A sentence sliding window approach to extract protein annotations from biomedical articles , 2005, BMC Bioinformatics.

[6]  Daisuke Kihara,et al.  Enhanced automated function prediction using distantly related sequences and contextual association by PFP , 2006, Protein science : a publication of the Protein Society.

[7]  Mark Craven,et al.  Exploiting Zone Information , Syntactic Features , and Informative Terms in Gene Ontology Annotation from Biomedical Documents , 2022 .

[8]  Mark Craven,et al.  Exploiting Zone Information, Syntactic Rules, and Informative Terms in Gene Ontology Annotation of Biomedical Documents , 2004, TREC.

[9]  Jeffrey T. Chang,et al.  Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. , 2002, Genome research.

[10]  Javed Mostafa,et al.  An application of text categorization methods to gene ontology annotation , 2005, SIGIR '05.

[11]  D. Eisenberg,et al.  Localizing proteins in the cell from their phylogenetic profiles. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[13]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[14]  Mário J. Silva,et al.  Finding genomic ontology terms in text using evidence content , 2005, BMC Bioinformatics.

[15]  Patrick Ruch,et al.  Data-poor categorization and passage retrieval for Gene Ontology Annotation in Swiss-Prot , 2005, BMC Bioinformatics.

[16]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.

[17]  William R. Hersh,et al.  A Survey of Current Work in Biomedical Text Mining , 2005 .

[18]  Marti A. Hearst,et al.  Predicting Gene Functions from Text Using a Cross-Species Approach , 2005, Pacific Symposium on Biocomputing.

[19]  Søren Brunak,et al.  Prediction of human protein function according to Gene Ontology categories , 2003, Bioinform..

[20]  Mark Craven,et al.  Learning Statistical Models for Annotating Proteins with Function Information using Biomedical Text , 2005, BMC Bioinformatics.

[21]  Patrick Ruch,et al.  Features Combination for Extracting Gene Functions from MEDLINE , 2005, ECIR.

[22]  Daisuke Kihara,et al.  Function Prediction of uncharacterized proteins , 2007, J. Bioinform. Comput. Biol..

[23]  Jung-Hsien Chiang,et al.  Extracting Functional Annotations of Proteins Based on Hybrid Text Mining Approaches , 2004 .

[24]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt): an expanding universe of protein information , 2005, Nucleic Acids Res..

[25]  Goran Nenadic,et al.  Mining protein function from text using term-based support vector machines , 2005, BMC Bioinformatics.

[26]  Cliff Joslyn,et al.  The Gene Ontology Categorizer , 2004, ISMB/ECCB.

[27]  Marti A. Hearst,et al.  TREC 2007 Genomics Track Overview , 2007, TREC.

[28]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[29]  Jung-Hsien Chiang,et al.  MeKE: Discovering the Functions of Gene Products from Biomedical Literature Via Sentence Alignment , 2003, Bioinform..

[30]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorisation: a survey , 1999 .

[32]  Toshihisa Takagi,et al.  Data and text mining Automatic extraction of gene / protein biological functions from biomedical text , 2005 .

[33]  Rachael P. Huntley,et al.  The Gene Ontology Annotation (GOA) Database , 2009 .

[34]  W. John Wilbur,et al.  A Strategy for Assigning New Concepts in the MEDLINE Database , 2005, AMIA.

[35]  Preslav Nakov,et al.  BioText Team Report for the TREC 2003 Genomics Track , 2003, TREC.

[36]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.