Classification of co-expressed genes from DNA regulatory regions

The analysis of non-coding DNA regulatory regions is one of the most challenging open problems in computational biology. In this paper we investigate whether we can predict functional information about genes by using information extracted from their sequences together with expression data. We formalize this problem as a classification problem, and we apply Support Vector Machines (SVMs) with non-linear kernels to predict classes of co-expressed genes obtained from clustering procedures. SVMs are trained using information about selected motifs extracted from DNA regulatory regions through combinatorial and statistical methods. In our experiments, we show that functional classes of genes can be predicted from biological sequence data in Saccharomices cerevisiae, achieving results competitive with those recently presented in the literature.

[1]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[2]  Michael Q. Zhang,et al.  DNA motifs in human and mouse proximal promoters predict tissue-specific expression. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Roded Sharan,et al.  A motif-based framework for recognizing sequence families , 2005, ISMB.

[4]  Caroline Smith,et al.  Establishing glucose- and ABA-regulated transcription networks in Arabidopsis by microarray analysis and promoter classification using a Relevance Vector Machine. , 2006, Genome research.

[5]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[6]  Michael A. Beer,et al.  Predicting Gene Expression from Sequence , 2004, Cell.

[7]  J. Mercer Functions of positive and negative type, and their connection with the theory of integral equations , 1909 .

[8]  Michael Q. Zhang,et al.  SCPD: a promoter database of the yeast Saccharomyces cerevisiae , 1999, Bioinform..

[9]  J. Mercer Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations , 1909 .

[10]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[11]  Pavel Paclík,et al.  Adaptive floating search methods in feature selection , 1999, Pattern Recognit. Lett..

[12]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[13]  S. C. Lakhotia,et al.  What is a gene? , 1997 .

[14]  M. Eisen,et al.  Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering , 2002, Genome Biology.

[15]  Graziano Pesole,et al.  In silico representation and discovery of transcription factor binding sites , 2004, Briefings Bioinform..

[16]  Helen Pearson,et al.  Genetics: What is a gene? , 2006, Nature.

[17]  Nan Li,et al.  Analysis of computational approaches for motif discovery , 2006, Algorithms for Molecular Biology.

[18]  Graziano Pesole,et al.  Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes , 2004, Nucleic Acids Res..

[19]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[21]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[22]  Michael Q. Zhang,et al.  Identifying tissue-selective transcription factor binding sites in vertebrate promoters. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[23]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.