Comparative genomics using data mining tools

We have analysed the genomes of representatives of three kingdoms of life, namely, archaea, eubacteria and eukaryota using data mining tools based on compositional analyses of the protein sequences. The representatives chosen in this analysis wereMethanococcus jannaschii, Haemophilus influenzae andSaccharomyces cerevisiae. We have identified the common and different features between the three genomes in the protein evolution patterns.M. jannaschii has been seen to have a greater number of proteins with more charged amino acids whereasS. cerevisiae has been observed to have a greater number of hydrophilic proteins. Despite the differences in intrinsic compositional characteristics between the proteins from the different genomes we have also identified certain common characteristics. We have carried out exploratory Principal Component Analysis of the multivariate data on the proteins of each organism in an effort to classify the proteins into clusters. Interestingly, we found that most of the proteins in each organism cluster closely together, but there are a few ‘outliers’. We focus on the outliers for the functional investigations, which may aid in revealing any unique features of the biology of the respective organisms.

[1]  K Nishikawa,et al.  The folding type of a protein is relevant to the amino acid composition. , 1986, Journal of biochemistry.

[2]  O. White,et al.  Global transposon mutagenesis and a minimal Mycoplasma genome. , 1999, Science.

[3]  K Nishikawa,et al.  The amino acid composition is different between the cytoplasmic and extracellular sides in membrane proteins , 1992, FEBS letters.

[4]  M. Gribskov,et al.  Sequence Analysis Primer , 1991 .

[5]  E. Koonin,et al.  Prediction of transcription regulatory sites in Archaea by a comparative genomic approach. , 2000, Nucleic acids research.

[6]  R. Fleischmann,et al.  The Minimal Gene Complement of Mycoplasma genitalium , 1995, Science.

[7]  S. Brahmachari,et al.  Polypurine.polypyrimidine sequences in complete bacterial genomes: preference for polypurines in protein-coding regions. , 2000, Gene.

[8]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[9]  T. Traut,et al.  A minimal gene set for cellular life derived by comparison of complete bacterial genomes , 1998 .

[10]  John C. Wootton,et al.  Non-globular Domains in Protein Sequences: Automated Segmentation Using Complexity Measures , 1994, Comput. Chem..

[11]  Marin van Heel,et al.  A new family of powerful multivariate statistical sequence analysis techniques. , 1991 .

[12]  C. Sander,et al.  A method to predict functional residues in proteins , 1995, Nature Structural Biology.

[13]  Michael Y. Galperin,et al.  The COG database: a tool for genome-scale analysis of protein functions and evolution , 2000, Nucleic Acids Res..

[14]  S. Ramakumar,et al.  Crystal structure at 1.8 A resolution and proposed amino acid sequence of a thermostable xylanase from Thermoascus aurantiacus. , 1999, Journal of molecular biology.

[15]  K Nishikawa,et al.  Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. , 1994, Journal of molecular biology.

[16]  Michael Y. Galperin,et al.  Beyond complete genomes: from sequence to structure and function. , 1998, Current opinion in structural biology.

[17]  M van Heel,et al.  A new family of powerful multivariate statistical sequence analysis techniques. , 1991, Journal of molecular biology.

[18]  Mark J. Forster,et al.  Application of distance geometry to 3D visualization of sequence relationships , 1999, Bioinform..

[19]  C. Sander,et al.  Functional Classes in the Three Domains of Life , 1999, Journal of Molecular Evolution.

[20]  G. Schneider,et al.  Development of artificial neural filters for pattern recognition in protein sequences , 1993, Journal of Molecular Evolution.

[21]  G. Schneider,et al.  How many potentially secreted proteins are contained in a bacterial genome? , 1999, Gene.