Network motif-based identification of transcription factor-target gene relationships by integrating multi-source biological data

BackgroundIntegrating data from multiple global assays and curated databases is essential to understand the spatio-temporal interactions within cells. Different experiments measure cellular processes at various widths and depths, while databases contain biological information based on established facts or published data. Integrating these complementary datasets helps infer a mutually consistent transcriptional regulatory network (TRN) with strong similarity to the structure of the underlying genetic regulatory modules. Decomposing the TRN into a small set of recurring regulatory patterns, called network motifs (NM), facilitates the inference. Identifying NMs defined by specific transcription factors (TF) establishes the framework structure of a TRN and allows the inference of TF-target gene relationship. This paper introduces a computational framework for utilizing data from multiple sources to infer TF-target gene relationships on the basis of NMs. The data include time course gene expression profiles, genome-wide location analysis data, binding sequence data, and gene ontology (GO) information.ResultsThe proposed computational framework was tested using gene expression data associated with cell cycle progression in yeast. Among 800 cell cycle related genes, 85 were identified as candidate TFs and classified into four previously defined NMs. The NMs for a subset of TFs are obtained from literature. Support vector machine (SVM) classifiers were used to estimate NMs for the remaining TFs. The potential downstream target genes for the TFs were clustered into 34 biologically significant groups. The relationships between TFs and potential target gene clusters were examined by training recurrent neural networks whose topologies mimic the NMs to which the TFs are classified. The identified relationships between TFs and gene clusters were evaluated using the following biological validation and statistical analyses: (1) Gene set enrichment analysis (GSEA) to evaluate the clustering results; (2) Leave-one-out cross-validation (LOOCV) to ensure that the SVM classifiers assign TFs to NM categories with high confidence; (3) Binding site enrichment analysis (BSEA) to determine enrichment of the gene clusters for the cognate binding sites of their predicted TFs; (4) Comparison with previously reported results in the literatures to confirm the inferred regulations.ConclusionThe major contribution of this study is the development of a computational framework to assist the inference of TRN by integrating heterogeneous data from multiple sources and by decomposing a TRN into NM-based modules. The inference capability of the proposed framework is verified statistically (e.g., LOOCV) and biologically (e.g., GSEA, BSEA, and literature validation). The proposed framework is useful for inferring small NM-based modules of TF-target gene relationships that can serve as a basis for generating new testable hypotheses.

[1]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[2]  Brian Birge,et al.  PSOt - a particle swarm optimization toolbox for use with Matlab , 2003, Proceedings of the 2003 IEEE Swarm Intelligence Symposium. SIS'03 (Cat. No.03EX706).

[3]  Ian Holmes,et al.  Evolutionary HMMs: a Bayesian approach to multiple alignment , 2001, Bioinform..

[4]  E. Fraenkel,et al.  WebMOTIFS: automated discovery, filtering and scoring of DNA sequence motifs using multiple programs and Bayesian approaches , 2007, Environmental health perspectives.

[5]  Satoru Miyano,et al.  Statistical analysis of a small set of time-ordered gene expression data using linear splines , 2002, Bioinform..

[6]  Ernst Wit,et al.  Statistics for Microarrays : Design, Analysis and Inference , 2004 .

[7]  Nicola J. Rinaldi,et al.  Serial Regulation of Transcriptional Regulators in the Yeast Cell Cycle , 2001, Cell.

[8]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[9]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  D Weigel,et al.  The fork head domain: A novel DNA binding motif of eukaryotic transcription factors? , 1990, Cell.

[11]  Megan F. Cole,et al.  Core Transcriptional Regulatory Circuitry in Human Embryonic Stem Cells , 2005, Cell.

[12]  John A. Hertz,et al.  Modeling Genetic Regulatory Dynamics in Neural Development , 2002, J. Comput. Biol..

[13]  S. Mangan,et al.  The coherent feedforward loop serves as a sign-sensitive delay element in transcription networks. , 2003, Journal of molecular biology.

[14]  C. Ball,et al.  Genetic and physical maps of Saccharomyces cerevisiae. , 1997, Nature.

[15]  Ian M. Donaldson,et al.  BIND: the Biomolecular Interaction Network Database , 2001, Nucleic Acids Res..

[16]  Werner Dubitzky,et al.  Comprar Fundamentals of Data Mining in Genomics and Proteomics | Dubitzky, Werner | 9780387475080 | Springer , 2007 .

[17]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[18]  Jung-Hsien Chiang,et al.  Modeling human cancer-related regulatory modules by GA-RNN hybrid algorithms , 2007, BMC Bioinformatics.

[19]  S. Shen-Orr,et al.  Networks Network Motifs : Simple Building Blocks of Complex , 2002 .

[20]  Albertha J. M. Walhout,et al.  Unraveling transcription regulatory networks by protein-DNA and protein-protein interaction mapping. , 2006, Genome research.

[21]  Doulaye Dembélé,et al.  Fuzzy C-means Method for Clustering Microarray Data , 2003, Bioinform..

[22]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[23]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[24]  P. Törönen,et al.  Analysis of gene expression data using self‐organizing maps , 1999, FEBS letters.

[25]  W. Loomis,et al.  Transcriptional regulation of post-aggregation genes in Dictyostelium by a feed-forward loop involving GBF and LagC. , 2006, Developmental biology.

[26]  Habtom W. Ressom,et al.  Adaptive double self-organizing maps for clustering gene expression profiles , 2003, Neural Networks.

[27]  S. Shen-Orr,et al.  Network motifs in the transcriptional regulation network of Escherichia coli , 2002, Nature Genetics.

[28]  A. Blais,et al.  Constructing transcriptional regulatory networks. , 2005, Genes & development.

[29]  Roger E Bumgarner,et al.  From co-expression to co-regulation: how many microarray experiments do we need? , 2004, Genome Biology.

[30]  Patrik D'haeseleer,et al.  Linear Modeling of mRNA Expression Levels During CNS Development and Injury , 1998, Pacific Symposium on Biocomputing.

[31]  Werner Dubitzky,et al.  Fundamentals of Data Mining in Genomics and Proteomics , 2009 .

[32]  Nicola J. Rinaldi,et al.  Control of Pancreas and Liver Gene Expression by HNF Transcription Factors , 2004, Science.

[33]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[34]  Xin Chen,et al.  The TRANSFAC system on gene expression regulation , 2001, Nucleic Acids Res..

[35]  Gary D. Stormo,et al.  Modeling Regulatory Networks with Weight Matrices , 1998, Pacific Symposium on Biocomputing.

[36]  Kagan Tuncay,et al.  Transcriptional Regulatory Networks via Gene Ontology and Expression Data , 2007, Silico Biol..

[37]  S. Mangan,et al.  Structure and function of the feed-forward loop network motif , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[38]  Chris Sander,et al.  Characterizing gene sets with FuncAssociate , 2003, Bioinform..

[39]  S. Shen-Orr,et al.  Superfamilies of Evolved and Designed Networks , 2004, Science.

[40]  John J. Wyrick,et al.  Genome-wide location and function of DNA binding proteins. , 2000, Science.

[41]  Francis D. Gibbons,et al.  Judging the quality of gene expression-based clustering methods using gene annotation. , 2002, Genome research.

[42]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[43]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[44]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[45]  Ed Keedwell,et al.  Discovering gene networks with a neural-genetic hybrid , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[46]  G. Swiers,et al.  Genetic regulatory networks programming hematopoietic stem cells and erythroid lineage specification. , 2006, Developmental biology.

[47]  U. Alon Network motifs: theory and experimental approaches , 2007, Nature Reviews Genetics.

[48]  Ben Taskar,et al.  Rich probabilistic models for gene expression , 2001, ISMB.

[49]  Rency S Varghese,et al.  Increasing the efficiency of fuzzy logic-based gene expression data analysis. , 2003, Physiological genomics.

[50]  Tommi S. Jaakkola,et al.  Combining Location and Expression Data for Principled Discovery of Genetic Regulatory Network Models , 2001, Pacific Symposium on Biocomputing.

[51]  Y. Wang,et al.  Inferring Network Interactions Using Recurrent Neural Networks and Swarm Intelligence , 2006, 2006 International Conference of the IEEE Engineering in Medicine and Biology Society.

[52]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[53]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[54]  D. Wagner,et al.  The LEAFY target LMI1 is a meristem identity regulator and acts together with LEAFY to regulate expression of CAULIFLOWER , 2006, Development.