Prediction of regulatory networks: genome-wide identification of transcription factor targets from gene expression data

MOTIVATION Defining regulatory networks, linking transcription factors (TFs) to their targets, is a central problem in post-genomic biology. One might imagine one could readily determine these networks through inspection of gene expression data. However, the relationship between the expression timecourse of a transcription factor and its target is not obvious (e.g. simple correlation over the timecourse), and current analysis methods, such as hierarchical clustering, have not been very successful in deciphering them. RESULTS Here we introduce an approach based on support vector machines (SVMs) to predict the targets of a transcription factor by identifying subtle relationships between their expression profiles. In particular, we used SVMs to predict the regulatory targets for 36 transcription factors in the Saccharomyces cerevisiae genome based on the microarray expression data from many different physiological conditions. We trained and tested our SVM on a data set constructed to include a significant number of both positive and negative examples, directly addressing data imbalance issues. This was non-trivial given that most of the known experimental information is only for positives. Overall, we found that 63% of our TF-target relationships were confirmed through cross-validation. We further assessed the performance of our regulatory network identifications by comparing them with the results from two recent genome-wide ChIP-chip experiments. Overall, we find the agreement between our results and these experiments is comparable to the agreement (albeit low) between the two experiments. We find that this network has a delocalized structure with respect to chromosomal positioning, with a given transcription factor having targets spread fairly uniformly across the genome. AVAILABILITY The overall network of the relationships is available on the web at http://bioinfo.mbb.yale.edu/expression/echipchip

[1]  Marc S Halfon,et al.  Computation-based discovery of related transcriptional regulatory modules and motifs using an experimentally validated combinatorial model. , 2002, Genome research.

[2]  R. Tupler,et al.  Profound misregulation of muscle-specific gene expression in facioscapulohumeral muscular dystrophy. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[3]  JapkowiczNathalie,et al.  The class imbalance problem: A systematic study , 2002 .

[4]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[5]  Mark Gerstein,et al.  Defining Genes in the Genomics Era , 2003, Science.

[6]  Xin Chen,et al.  The TRANSFAC system on gene expression regulation , 2001, Nucleic Acids Res..

[7]  G. Stormo,et al.  Identification of a novel cis-regulatory element involved in the heat shock response in Caenorhabditis elegans using microarray gene expression and computational methods. , 2002, Genome research.

[8]  John J. Wyrick,et al.  Genome-wide location and function of DNA binding proteins. , 2000, Science.

[9]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[10]  D. Lockhart,et al.  Mitotic misregulation and human aging. , 2000, Science.

[11]  C. Yanover,et al.  Computer analysis of the entire budding yeast genome for putative targets of the GCN4 transcription factor , 1998, Current Genetics.

[12]  Michael Ruogu Zhang,et al.  Computer-assisted identification of cell cycle-related genes: new targets for E2F transcription factors. , 2001, Journal of molecular biology.

[13]  Nathalie Japkowicz,et al.  The Class Imbalance Problem: Significance and Strategies , 2000 .

[14]  R. R. Samaha,et al.  Arabidopsis transcription factors: genome-wide comparative analysis among eukaryotes. , 2000, Science.

[15]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Michael Q. Zhang,et al.  SCPD: a promoter database of the yeast Saccharomyces cerevisiae , 1999, Bioinform..

[17]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[18]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[19]  G. Church,et al.  Computational identification of transcription factor binding sites via a transcription-factor-centric clustering (TFCC) algorithm. , 2002, Journal of molecular biology.

[20]  Niels Grabe,et al.  AliBaba2: Context specific identification of transcription factor binding sites , 2000, Silico Biol..

[21]  D. Botstein,et al.  Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF , 2001, Nature.

[22]  Xiangji Huang,et al.  A Case Study for Learning from Imbalanced Data Sets , 2001, Canadian Conference on AI.

[23]  Mark Gerstein,et al.  Genomics. Defining genes in the genomics era. , 2003, Science.

[24]  W. Wasserman,et al.  A predictive model for regulatory sequences directing liver-specific transcription. , 2001, Genome research.

[25]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[26]  J. Davies,et al.  Molecular Biology of the Cell , 1983, Bristol Medico-Chirurgical Journal.

[27]  C. Jacq,et al.  Transcriptomes, transcription activators and microarrays , 2001, FEBS letters.

[28]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[29]  G D Stormo,et al.  A comparative genomics approach to prediction of new members of regulons. , 2001, Genome research.

[30]  M. Gerstein,et al.  The current excitement in bioinformatics-analysis of whole-genome expression data: how does it relate to protein structure and function? , 2000, Current opinion in structural biology.

[31]  Nicolas Mermod,et al.  Evaluation of computer tools for the prediction of transcription factor binding sites on genomic DNA , 1998, Silico Biol..

[32]  M. Gerstein,et al.  Complex transcriptional circuitry at the G1/S transition in Saccharomyces cerevisiae. , 2002, Genes & development.

[33]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[34]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[35]  M. Gerstein,et al.  Beyond synexpression relationships: local clustering of time-shifted and inverted gene expression profiles identifies new, biologically relevant interactions. , 2001, Journal of molecular biology.