Large-scale protein annotation through gene ontology.

Recent progress in genomic sequencing, computational biology, and ontology development has presented an opportunity to investigate biological systems from a unique perspective, that is, examining genomes and transcriptomes through the multiple and hierarchical structure of Gene Ontology (GO). We report here our development of GO Engine, a computational platform for GO annotation, and analysis of the resultant GO annotations of human proteins. Protein annotation was centered on sequence homology with GO-annotated proteins and protein domain analysis. Text information analysis and a multiparameter cellular localization predictive tool were also used to increase the annotation accuracy, and to predict novel annotations. The majority of proteins corresponding to full-length mRNA in GenBank, and the majority of proteins in the NR database (nonredundant database of proteins) were annotated with one or more GO nodes in each of the three GO categories. The annotations of GenBank and SWISS-PROT proteins are available to the public at the GO Consortium web site.

[1]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[2]  J. Blake,et al.  Creating the Gene Ontology Resource : Design and Implementation The Gene Ontology Consortium 2 , 2001 .

[3]  M Dickson,et al.  Comparative DNA sequence analysis of mouse and human protocadherin gene clusters. , 2001, Genome research.

[4]  Philip Lijnzaad,et al.  The Ensembl genome database project , 2002, Nucleic Acids Res..

[5]  Peter W. H. Holland,et al.  Ancient origin of the Hox gene cluster , 2001, Nature Reviews Genetics.

[6]  Alexander Schliep,et al.  Clustering Protein Sequences ? Structure Prediction by Transitive Homology , 2001, German Conference on Bioinformatics.

[7]  The FlyBase database of the Drosophila genome projects and community literature. , 2003, Nucleic acids research.

[8]  N Linial,et al.  ProtoMap: Automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space , 1999, Proteins.

[9]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[10]  Alex Bateman,et al.  The InterPro database, an integrated documentation resource for protein families, domains and functional sites , 2001, Nucleic Acids Res..

[11]  M. Kalff-Suske,et al.  All Human Genes of the Uteroglobin Family Are Localized on Chromosome 11q12.2 and Form a Dense Cluster , 2000, Annals of the New York Academy of Sciences.

[12]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[13]  Al Stutz,et al.  A draft annotation and overview of the human genome , 2001, Genome Biology.

[14]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[15]  Qi Li,et al.  Ontology acquisition from on-line knowledge sources , 2000, AMIA.

[16]  Kara Dolinski,et al.  Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO) , 2002, Nucleic Acids Res..

[17]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[18]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[19]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.