Analysis of protein domain families in Caenorhabditis elegans.

The Caenorhabditis elegans genome sequencing project has completed over half of this nematode's 100-Mb genome. Proteins predicted in the finished sequence have been compiled and released in the data-base Wormpep. Presented here is a comprehensive analysis of protein domain families in Wormpep 11, which comprises 7299 proteins. The relative abundance of common protein domain families was counted by comparing all Wormpep proteins to the Pfam collection of protein families, which is based on recognition by hidden Markov models. This analysis also identified a number of previously unannotated domains. To investigate new apparently nematode-specific protein families, Wormpep was clustered into domain families on the basis of sequence similarity using the Domainer program. The largest clusters that lacked clear homology to proteins outside Nematoda were analyzed in further detail, after which some could be assigned a putative function. We compared all proteins in Wormpep 11 to proteins in the human, Saccharomyces cerevisiae, and Haemophilus influenzae genomes. Among the results are the estimation that over two-thirds of the currently known human proteins are likely to have a homologue in the whole C. elegans genome and that a significant number of proteins are well conserved between C. elegans and H. influenzae, that are not found in S. cerevisiae.

[1]  System Sciences , 1999, Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers.

[2]  Claire O'Donovan,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999 , 1999, Nucleic Acids Res..

[3]  J. Wootton,et al.  Widespread eukaryotic sequences, highly similar to bacterial DNA polymerase I, looking for functions , 1997, Current Biology.

[4]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[5]  Eugene V. Koonin,et al.  SEALS: A System for Easy Analysis of Lots of Sequences , 1997, ISMB.

[6]  A. Chinnaiyan,et al.  Interaction of CED-4 with CED-3 and CED-9: A Molecular Framework for Cell Death , 1997, Science.

[7]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[8]  A. Bairoch,et al.  The PROSITE database, its status in 1997 , 1997, Nucleic Acids Res..

[9]  Ross A. Overbeek,et al.  The RDP (Ribosomal Database Project) , 1997, Nucleic Acids Res..

[10]  P. Bork,et al.  Non-orthologous gene displacement. , 1996, Trends in genetics : TIG.

[11]  J Moult,et al.  The current state of the art in protein structure prediction. , 1996, Current opinion in biotechnology.

[12]  C Sander,et al.  Bioinformatics and the discovery of gene function. , 1996, Trends in genetics : TIG.

[13]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[14]  David C. Jones,et al.  Potential energy functions for threading. , 1996, Current opinion in structural biology.

[15]  P. Bork,et al.  Metabolism and evolution of Haemophilus influenzae deduced from a whole-genome comparison with Escherichia coli , 1996, Current Biology.

[16]  G M Rubin,et al.  Around the genomes: the Drosophila genome project. , 1996, Genome research.

[17]  Eugene V. Koonin,et al.  [18] Protein sequence comparison at genome scale , 1996 .

[18]  R. Durbin,et al.  A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. , 1995, Gene.

[19]  J. Sulston,et al.  The genome of Caenorhabditis elegans. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[20]  C. Chothia,et al.  Gene duplications in H. influenzae , 1995, Nature.

[21]  R. Waterston,et al.  The Nematode Caenorhabditis elegans and Its Genome , 1995, Science.

[22]  Cori Bargmann,et al.  Divergent seven transmembrane receptors are candidate chemosensory receptors in C. elegans , 1995, Cell.

[23]  Peer Bork,et al.  Exploring the Mycoplasma capricolum genome: a minimal cell reveals its physiology , 1995, Molecular microbiology.

[24]  Jinya Otsuka,et al.  A comprehensive representation of extensive similarity linkage between large numbers of proteins , 1995, Comput. Appl. Biosci..

[25]  B. Rost,et al.  Transmembrane helices predicted at 95% accuracy , 1995, Protein science : a publication of the Protein Society.

[26]  Gilles Bisson,et al.  APIC : A Generic Interface for Sequencing Projects , 1995, ISMB.

[27]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[28]  H. Horvitz,et al.  Programmed cell death in Caenorhabditis elegans. , 1994, Current opinion in genetics & development.

[29]  Erik L. L. Sonnhammer,et al.  A workbench for large-scale sequence homology analysis , 1994, Comput. Appl. Biosci..

[30]  R. Durbin,et al.  2.2 Mb of contiguous nucleotide sequence from chromosome III of C. elegans , 1994, Nature.

[31]  E. Sonnhammer,et al.  Modular arrangement of proteins as inferred from analysis of homology , 1994, Protein science : a publication of the Protein Society.

[32]  C. Sander,et al.  Yeast chromosome III: new gene functions. , 1994, The EMBO journal.

[33]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[34]  James Ostell,et al.  ChromoScope: a graphic interactive browser for E. coli data expressed in the NCBI data model , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[35]  Chris Sander,et al.  GeneQuiz: A Workbench for Sequence Analysis , 1994, ISMB.

[36]  Lawrence Hunter,et al.  Computationally Efficient Cluster Representation in Molecular Sequence Megaclassification , 1993, ISMB.

[37]  P. Green,et al.  Ancient conserved regions in new gene sequences and the protein databases. , 1993, Science.

[38]  J. Ito,et al.  Compilation, alignment, and phylogenetic relationships of DNA polymerases. , 1993, Nucleic acids research.

[39]  D. States,et al.  Efficient Classification of Massive, Unsegmented Datastreams , 1992, ML.

[40]  C. Anderson,et al.  A turn of the worm , 1992, Nature.

[41]  S. Henikoff,et al.  Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[42]  A. Lupas,et al.  Predicting coiled coils from protein sequences , 1991, Science.

[43]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[44]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[45]  H. Charles Romesburg,et al.  Cluster analysis for researchers , 1984 .

[46]  W. Fitch,et al.  Evolution of antibiotic resistance genes: the DNA sequence of a kanamycin resistance gene from Staphylococcus aureus. , 1983, Molecular biology and evolution.