Functional Annotation of Genes through Statistical Analysis of Biomedical Articles

One of the most elaborate and important tasks in biology is the functional annotation of genes. Biologists have developed standardized and structured vocabularies, called bio-ontologies, to assist them in describing the different functions. A critical issue in the assignment of functions to genes is the utilization of knowledge from published biomedical articles. The purpose of this paper is to present a unified and comprehensive statistical methodology for functionally annotating genes using biomedical literature. Specifically, classification models are built using the discriminant analysis method while validation, analysis and interpretation of the results is based on graphical methods and various performance metrics and techniques. The general conclusions from the study are very promising, in the sense that the proposed methodology not only performs well in the assignment of functions to genes, but also provides useful and interpretable results regarding the discriminating power of certain keywords in the texts

[1]  Miguel A. Andrade-Navarro,et al.  Automatic Annotation for Biological Sequences by Etraction of Keywords from MEDLINE Abstracts: Development of a Prototype System , 1997, ISMB.

[2]  Brian D. Ripley,et al.  Modern Applied Statistics with S Fourth edition , 2002 .

[3]  J. Blake,et al.  Creating the Gene Ontology Resource : Design and Implementation The Gene Ontology Consortium 2 , 2001 .

[4]  Peer Bork,et al.  Evaluation of human-readable annotation in biomolecular sequence databases with biological rule libraries , 1999, Bioinform..

[5]  A. Min Tjoa,et al.  Proceedings of the 12th International Workshop on Database and Expert Systems Applications , 2001 .

[6]  Kurt Hornik,et al.  Benchmarking Support Vector Machines , 2002 .

[7]  Hans Lehrach,et al.  Automated Gene Ontology annotation for anonymous sequence data , 2003, Nucleic Acids Res..

[8]  D. Barrell,et al.  The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. , 2003, Genome research.

[9]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.

[10]  Peter D. Karp,et al.  EcoCyc: a comprehensive database resource for Escherichia coli , 2004, Nucleic Acids Res..

[11]  Gerard Salton,et al.  Automatic text analysis , 1970, J. Am. Soc. Inf. Sci..

[12]  D. F. Andrews,et al.  PLOTS OF HIGH-DIMENSIONAL DATA , 1972 .

[13]  Eisaku Maeda,et al.  Assigning gene ontology categories (GO) to yeast genes using text-based supervised learning methods , 2004 .

[14]  Jeffrey T. Chang,et al.  Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. , 2002, Genome research.

[15]  Wei-Yin Loh,et al.  A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms , 2000, Machine Learning.

[16]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .