Gene3D: comprehensive structural and functional annotation of genomes

Gene3D provides comprehensive structural and functional annotation of most available protein sequences, including the UniProt, RefSeq and Integr8 resources. The main structural annotation is generated through scanning these sequences against the CATH structural domain database profile-HMM library. CATH is a database of manually derived PDB-based structural domains, placed within a hierarchy reflecting topology, homology and conservation and is able to infer more ancient and divergent homology relationships than sequence-based approaches. This data is supplemented with Pfam-A, other non-domain structural predictions (i.e. coiled coils) and experimental data from UniProt. In order to enhance the investigations possible with this data, we have also incorporated a variety of protein annotation resources, including protein–protein interaction data, GO functional assignments, KEGG pathways, FUNCAT functional descriptions and links to microarray expression data. All of this data can be accessed through a newly re-designed website that has a focus on flexibility and clarity, with searches that can be restricted to a single genome or across the entire sequence database. Currently Gene3D contains over 3.5 million domain assignments for nearly 5 million proteins including 527 completed genomes. This is available at: http://gene3d.biochem.ucl.ac.uk/

[1]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[2]  Ingmar Reuter,et al.  Integr8 and Genome Reviews: integrated views of complete genomes and proteomes , 2004, Nucleic Acids Res..

[3]  J. Skolnick,et al.  How well is enzyme function conserved as a function of pairwise sequence identity? , 2003, Journal of molecular biology.

[4]  B. Rost Enzyme function less conserved than anticipated. , 2002, Journal of molecular biology.

[5]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[6]  Hongfang Liu,et al.  BioThesaurus: a web-based thesaurus of protein and gene names , 2006, Bioinform..

[7]  Christine A. Orengo,et al.  Towards a comprehensive structural coverage of completed genomes: a structural genomics viewpoint , 2007, BMC Bioinformatics.

[8]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[9]  Haruki Nakamura,et al.  Announcing the worldwide Protein Data Bank , 2003, Nature Structural Biology.

[10]  Hans-Werner Mewes,et al.  MPact: the MIPS protein interaction resource on yeast , 2005, Nucleic Acids Res..

[11]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[12]  Thomas Rattei,et al.  SIMAP - The similarity matrix of proteins , 2005, ECCB/JBI.

[13]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[14]  Gabriele Ausiello,et al.  MINT: the Molecular INTeraction database , 2006, Nucleic Acids Res..

[15]  Helen E. Parkinson,et al.  ArrayExpress—a public database of microarray experiments and gene expression profiles , 2006, Nucleic Acids Res..

[16]  C. Chothia,et al.  Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. , 2001, Journal of molecular biology.

[17]  Robert D. Finn,et al.  New developments in the InterPro database , 2007, Nucleic Acids Res..

[18]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[19]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[20]  David A. Lee,et al.  Identification and distribution of protein families in 120 completed genomes using Gene3D , 2005, Proteins.

[21]  Kiyoko F. Aoki-Kinoshita,et al.  From genomics to chemical genomics: new developments in KEGG , 2005, Nucleic Acids Res..

[22]  Hagen Blankenburg,et al.  The implications of alternative splicing in the ENCODE protein complement , 2007, Proceedings of the National Academy of Sciences.

[23]  Frances M. G. Pearl,et al.  The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution , 2006, Nucleic Acids Res..

[24]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[25]  V. McKusick Mendelian inheritance in man , 1971 .

[26]  H. Mewes,et al.  The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. , 2004, Nucleic acids research.

[27]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[28]  Robert D. Finn,et al.  ProServer: a simple, extensible Perl DAS server , 2007, Bioinform..

[29]  Andrew C. R. Martin Databases and ontologies Mapping PDB chains to UniProtKB entries , 2005 .

[30]  Christine A. Orengo,et al.  Predicting Protein Function with Hierarchical Phylogenetic Profiles: The Gene3D Phylo-Tuner Method Applied to Eukaryotic Genomes , 2007, PLoS Comput. Biol..

[31]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[32]  J. Naylor,et al.  Mendelian inheritance in man: A catalog of human genes and genetic disorders , 1996 .

[33]  Y. Zhang,et al.  IntAct—open source resource for molecular interaction data , 2006, Nucleic Acids Res..