Whole Genome Protein Domain Analysis using a New Method for Domain Clustering

We present the outcome of a systematic analysis of protein domain shuffling in 17 completed microbial genomes. This analysis has been performed using MKDOM Version 2, a completely new version of the domain clustering program MKDOM based on PSI-BLAST recursive homology searches. It allows to delineate the most frequent protein domain building blocks, which domains are found specifically in Bacteria, Archaea or yeast, and which domains are shared between two or all three domains of life. The latter are good candidates as the basic protein building blocks underlying all forms of cellular life. Statistics of multi-domain proteins indicate that some organisms such as Bacillus subtilis or Mycobacterium tuberculosis contain an abnormally high number of large multi-domain proteins. We also provide examples of highly shuffled or circularly permutated domains. A WWW graphical interface has been made available to interactively browse domain arrangements of proteins in all 17 genomes, at http:@www.toulouse.inra.fr/prodomCG.html.

[1]  E. Sonnhammer,et al.  Modular arrangement of proteins as inferred from analysis of homology , 1994, Protein science : a publication of the Protein Society.

[2]  Jérôme Gouzy,et al.  Recent improvements of the ProDom database of protein domain families , 1999, Nucleic Acids Res..

[3]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[4]  Jérôme Gracy,et al.  Automated protein sequence database classification. II. Delineation Of domain boundaries from sequence similarities , 1998, Bioinform..

[5]  F. Corpet Multiple sequence alignment with hierarchical clustering. , 1988, Nucleic acids research.

[6]  Sarah A. Teichmann,et al.  DIVCLUS: an automatic method in the GEANFAMMER package that finds homologous domains in single- and multi-domain proteins , 1998, Bioinform..

[7]  J. Wootton,et al.  Analysis of compositionally biased regions in sequence databases. , 1996, Methods in enzymology.

[8]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[9]  B. Barrell,et al.  Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence , 1998, Nature.

[10]  Mark S. Boguski,et al.  A repeating amino acid motif in CDC23 defines a family of proteins and a new relationship among genes required for mitosis and RNA synthesis , 1990, Cell.

[11]  R. Huber,et al.  The complete genome of the hyperthermophilic bacterium Aquifex aeolicus , 1998, Nature.