Correlation and prediction of gene expression level from amino acid and dipeptide composition of its protein

BackgroundA large number of papers have been published on analysis of microarray data with particular emphasis on normalization of data, detection of differentially expressed genes, clustering of genes and regulatory network. On other hand there are only few studies on relation between expression level and composition of nucleotide/protein sequence, using expression data. There is a need to understand why particular genes/proteins express more in particular conditions. In this study, we analyze 3468 genes of Saccharomyces cerevisiae obtained from Holstege et al., (1998) to understand the relationship between expression level and amino acid composition.ResultsWe compute the correlation between expression of a gene and amino acid composition of its protein. It was observed that some residues (like Ala, Gly, Arg and Val) have significant positive correlation (r > 0.20) and some other residues (Like Asp, Leu, Asn and Ser) have negative correlation (r < -0.15) with the expression of genes. A significant negative correlation (r = -0.18) was also found between length and gene expression. These observations indicate the relationship between percent composition and gene expression level. Thus, attempts have been made to develop a Support Vector Machine (SVM) based method for predicting the expression level of genes from its protein sequence. In this method the SVM is trained with proteins whose gene expression data is known in a given condition. Then trained SVM is used to predict the gene expression of other proteins of the same organism in the same condition. A correlation coefficient r = 0.70 was obtained between predicted and experimentally determined expression of genes, which improves from r = 0.70 to 0.72 when dipeptide composition was used instead of residue composition. The method was evaluated using 5-fold cross validation test. We also demonstrate that amino acid composition information along with gene expression data can be used for improving the function classification of proteins.ConclusionThere is a correlation between gene expression and amino acid composition that can be used to predict the expression level of genes up to a certain extent. A web server based on the above strategy has been developed for calculating the correlation between amino acid composition and gene expression and prediction of expression level http://kiwi.postech.ac.kr/raghava/lgepred/. This server will allow users to study the evolution from expression data.

[1]  G P S Raghava,et al.  GWFASTA: server for FASTA search in eukaryotic and microbial genomes. , 2002, BioTechniques.

[2]  Stilianos Arhondakis,et al.  Base composition and expression level of human genes. , 2004, Gene.

[3]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[4]  Takashi Gojobori,et al.  Metabolic efficiency and amino acid composition in the proteomes of Escherichia coli and Bacillus subtilis , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[5]  M Gerstein,et al.  Genome-wide analysis relating expression level with protein subcellular localization. , 2000, Trends in genetics : TIG.

[6]  G. Bernardi,et al.  The vertebrate genome: isochores and evolution. , 1993, Molecular biology and evolution.

[7]  M. Gerstein,et al.  Relationship between gene co-expression and probe localization on microarray slides , 2003, BMC Genomics.

[8]  Ronald W. Davis,et al.  Functional profiling of the Saccharomyces cerevisiae genome , 2002, Nature.

[9]  H. Akashi,et al.  Gene expression and molecular evolution. , 2001, Current opinion in genetics & development.

[10]  Manoj Bhasin,et al.  Analysis and prediction of affinity of TAP binding peptides using cascade SVM , 2004, Protein science : a publication of the Protein Society.

[11]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[12]  M. Gerstein,et al.  Genomic analysis of gene expression relationships in transcriptional regulatory networks. , 2003, Trends in genetics : TIG.

[13]  Gajendra P. S. Raghava,et al.  Correlation between Expression Level of Gene and Codon Usage , 2004 .

[14]  Michael A. Beer,et al.  Predicting Gene Expression from Sequence , 2004, Cell.

[15]  Kuo-Chen Chou,et al.  Predicting subcellular localization of proteins in a hybridization space , 2004, Bioinform..

[16]  M. Gerstein,et al.  Subcellular localization of the yeast proteome. , 2002, Genes & development.

[17]  M. Q. Zhang Large-scale gene expression data analysis: a new challenge to computational biologists. , 1999, Genome research.

[18]  Kuo-Chen Chou,et al.  Prediction and classification of protein subcellular location—sequence‐order effect and pseudo amino acid composition , 2003, Journal of cellular biochemistry.

[19]  Hiroshi Akashi,et al.  Translational selection and yeast proteome evolution. , 2003, Genetics.

[20]  A. Vinogradov Compactness of human housekeeping genes: selection for economy or genomic design? , 2004, Trends in genetics : TIG.

[21]  Michael R. Green,et al.  Dissecting the Regulatory Circuitry of a Eukaryotic Genome , 1998, Cell.

[22]  Alexander E Vinogradov,et al.  Isochores and tissue-specificity. , 2003, Nucleic acids research.

[23]  Mark Gerstein,et al.  Reconstructing genetic networks in yeast , 2003, Nature Biotechnology.

[24]  Shizhong Xu,et al.  Supervised cluster analysis for microarray data based on multivariate Gaussian mixture , 2004, Bioinform..

[25]  Giorgio Bernardi,et al.  Correlations between the compositional properties of human genes, codon usage, and amino acid composition of proteins , 1991, Journal of Molecular Evolution.

[26]  K C Chou,et al.  An analysis of protein folding type prediction by seed-propagated sampling and jackknife test , 1995, Journal of protein chemistry.

[27]  L. Samson,et al.  Global response of Saccharomyces cerevisiae to an alkylating agent. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[28]  K. H. Wolfe,et al.  Relationship of codon bias to mRNA concentration and protein length in Saccharomyces cerevisiae , 2000, Yeast.

[29]  Kuo-Chen Chou,et al.  Nearest neighbour algorithm for predicting protein subcellular location by combining functional domain composition and pseudo-amino acid composition. , 2003, Biochemical and biophysical research communications.

[30]  Mark Gerstein,et al.  Revisiting the codon adaptation index from a whole-genome perspective: analyzing the relationship between gene expression and codon occurrence in yeast using a variety of models. , 2003, Nucleic acids research.

[31]  Gajendra P S Raghava,et al.  Classification of Nuclear Receptors Based on Amino Acid Composition and Dipeptide Composition* , 2004, Journal of Biological Chemistry.

[32]  Gajendra P. S. Raghava,et al.  ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST , 2004, Nucleic Acids Res..