Integrated biclustering of heterogeneous genome-wide datasets for the inference of global regulatory networks

BackgroundThe learning of global genetic regulatory networks from expression data is a severely under-constrained problem that is aided by reducing the dimensionality of the search space by means of clustering genes into putatively co-regulated groups, as opposed to those that are simply co-expressed. Be cause genes may be co-regulated only across a subset of all observed experimental conditions, biclustering (clustering of genes and conditions) is more appropriate than standard clustering. Co-regulated genes are also often functionally (physically, spatially, genetically, and/or evolutionarily) associated, and such a priori known or pre-computed associations can provide support for appropriately grouping genes. One important association is the presence of one or more common cis-regulatory motifs. In organisms where these motifs are not known, their de novo detection, integrated into the clustering algorithm, can help to guide the process towards more biologically parsimonious solutions.ResultsWe have developed an algorithm, cMonkey, that detects putative co-regulated gene groupings by integrating the biclustering of gene expression data and various functional associations with the de novo detection of sequence motifs.ConclusionWe have applied this procedure to the archaeon Halobacterium NRC-1, as part of our efforts to decipher its regulatory network. In addition, we used cMonkey on public data for three organisms in the other two domains of life: Helicobacter pylori, Saccharomyces cerevisiae, and Escherichia coli. The biclusters detected by cMonkey both recapitulated known biology and enabled novel predictions (some for Halobacterium were subsequently confirmed in the laboratory). For example, it identified the bacteriorhodopsin regulon, assigned additional genes to this regulon with apparently unrelated function, and detected its known promoter motif. We have performed a thorough comparison of cMonkey results against other clustering methods, and find that cMonkey biclusters are more parsimonious with all available evidence for co-regulation.

[1]  Adam P. Arkin,et al.  OpWise: Operons aid the identification of differentially expressed genes in bacterial microarray experiments , 2005, BMC Bioinformatics.

[2]  D. Eisenberg,et al.  Protein function in the post-genomic era , 2000, Nature.

[3]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[4]  K. Hughes,et al.  Coupling of Flagellar Gene Expression to Flagellar Assembly in Salmonella enterica Serovar Typhimurium andEscherichia coli , 2000, Microbiology and Molecular Biology Reviews.

[5]  John Bertin,et al.  Nod1 responds to peptidoglycan delivered by the Helicobacter pylori cag pathogenicity island , 2004, Nature Immunology.

[6]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[7]  J. Wojcik,et al.  The protein–protein interaction map of Helicobacter pylori , 2001, Nature.

[8]  Joshua M. Stuart,et al.  A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules , 2003, Science.

[9]  Robert Gentleman,et al.  A graph-theoretic approach to testing associations between disparate sources of functional genomics data , 2004, Bioinform..

[10]  Martin J Blaser,et al.  Promoter analysis of Helicobacter pylori genes with enhanced expression at low pH , 2003, Molecular microbiology.

[11]  David J. Reiss,et al.  The Gaggle: An open-source software system for integrating bioinformatics software and data sources , 2006, BMC Bioinformatics.

[12]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[13]  A. Regev,et al.  Conservation and evolvability in regulatory networks: the evolution of ribosomal regulation in yeast. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Atul J. Butte,et al.  Systematic survey reveals general applicability of "guilt-by-association" within gene coexpression networks , 2005, BMC Bioinformatics.

[15]  William Noble Grundy,et al.  Meta-MEME: motif-based hidden Markov models of protein families , 1997, Comput. Appl. Biosci..

[16]  M. Sagot,et al.  Inferring regulatory elements from a whole genome. An analysis of Helicobacter pylori sigma(80) family of promoter signals. , 2000, Journal of molecular biology.

[17]  Gary D Bader,et al.  BIND--The Biomolecular Interaction Network Database. , 2001, Nucleic acids research.

[18]  David Botstein,et al.  The Stanford Microarray Database , 2001, Nucleic Acids Res..

[19]  Kathleen Marchal,et al.  Prediction and overview of the RpoN-regulon in closely related species of the Rhizobiales , 2002, Genome Biology.

[20]  T. M. Murali,et al.  Extracting Conserved Gene Expression Motifs from Gene Expression Data , 2002, Pacific Symposium on Biocomputing.

[21]  Nello Cristianini,et al.  Discovering Transcriptional Modules from Motif, Chip-Chip and Microarray Data , 2004, Pacific Symposium on Biocomputing.

[22]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[23]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[24]  Ron Shamir,et al.  EXPANDER – an integrative program suite for microarray data analysis , 2005, BMC Bioinformatics.

[25]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[26]  Philip S. Yu,et al.  Enhanced biclustering on expression data , 2003, Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings..

[27]  Shiladitya DasSarma,et al.  Genomic Analysis of Anaerobic Respiration in the Archaeon Halobacterium sp. Strain NRC-1: Dimethyl Sulfoxide and Trimethylamine N-Oxide as Terminal Electron Acceptors , 2005, Journal of bacteriology.

[28]  Min Pan,et al.  A systems view of haloarchaeal strategies to withstand stress from transition metals. , 2006, Genome research.

[29]  Jacques van Helden,et al.  Regulatory Sequence Analysis Tools , 2003, Nucleic Acids Res..

[30]  Bart De Moor,et al.  Biclustering microarray data by Gibbs sampling , 2003, ECCB.

[31]  R. Overbeek,et al.  The use of gene clusters to infer functional coupling. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Amanda Clare,et al.  How well do we understand the clusters found in microarray data? , 2002, Silico Biol..

[33]  Ian M. Donaldson,et al.  BIND: the Biomolecular Interaction Network Database , 2001, Nucleic Acids Res..

[34]  Matteo Pellegrini,et al.  Prolinks: a database of protein functional linkages derived from coevolution , 2004, Genome Biology.

[35]  Richard Bonneau,et al.  The Inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo , 2006, Genome Biology.

[36]  Ting Wang,et al.  Combining phylogenetic data with co-regulated genes to identify regulatory motifs , 2003, Bioinform..

[37]  Padraig Cunningham,et al.  Application of Simulated Annealing to the Biclustering of Gene Expression Data , 2006, IEEE Transactions on Information Technology in Biomedicine.

[38]  G. Church,et al.  Identifying regulatory networks by combinatorial analysis of promoter elements , 2001, Nature Genetics.

[39]  N. Baliga,et al.  Genomic and genetic dissection of an archaeal regulon , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[40]  G. Rubin,et al.  Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[41]  Nicola J. Rinaldi,et al.  Computational discovery of gene modules and regulatory networks , 2003, Nature Biotechnology.

[42]  Patrik D'haeseleer,et al.  Genetic network inference: from co-expression clustering to reverse engineering , 2000, Bioinform..

[43]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[44]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[45]  N. Baliga,et al.  Saturation mutagenesis of the haloarchaeal bop gene promoter: identification of DNA supercoiling sensitivity sites and absence of TFB recognition element and UAS enhancer activity , 2000, Molecular microbiology.

[46]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[47]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[48]  Purvesh Khatri,et al.  Onto-Tools: an ensemble of web-accessible, ontology-based tools for the functional design and interpretation of high-throughput gene expression experiments , 2004, Nucleic Acids Res..

[49]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[50]  U. Alon,et al.  Ordering Genes in a Flagella Pathway by Analysis of Expression Kinetics from Living Bacteria , 2001, Science.

[51]  C. Josenhans,et al.  Colonization of gnotobiotic piglets by Helicobacter pylori deficient in two flagellin genes , 1996, Infection and immunity.

[52]  Peter D Wentzell,et al.  Genomic analysis of stationary-phase and exit in Saccharomyces cerevisiae: gene expression and identification of novel essential genes. , 2004, Molecular biology of the cell.

[53]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[54]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[55]  Charles DeLisi,et al.  Predictome: a database of putative functional links between proteins , 2002, Nucleic Acids Res..

[56]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[57]  Marcel J. T. Reinders,et al.  Multi-criterion optimization for genetic network modeling , 2003, Signal Process..

[58]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[59]  Chris Sander,et al.  Characterizing gene sets with FuncAssociate , 2003, Bioinform..

[60]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[61]  George M. Church,et al.  Filling gaps in a metabolic network using expression information , 2004, ISMB/ECCB.

[62]  H. Feldmann,et al.  Rpn4p acts as a transcription factor by binding to PACE, a nonamer box found upstream of 26S proteasomal and other genes in yeast , 1999, FEBS letters.

[63]  D. Firth Bias reduction of maximum likelihood estimates , 1993 .

[64]  Joseph T. Chang,et al.  Spectral biclustering of microarray data: coclustering genes and conditions. , 2003, Genome research.

[65]  G. Church,et al.  A comprehensive library of DNA-binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 genome. , 1998, Journal of molecular biology.

[66]  Julio Collado-Vides,et al.  A powerful non-homology method for the prediction of operons in prokaryotes , 2002, ISMB.

[67]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[68]  Sven Bergmann,et al.  Iterative signature algorithm for the analysis of large-scale gene expression data. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[69]  Min Pan,et al.  Coordinate regulation of energy transduction modules in Halobacterium sp. analyzed by a global systems approach , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[70]  Ralph Schlapbach,et al.  Genome‐wide analysis of transcriptional hierarchy and feedback regulation in the flagellar system of Helicobacter pylori , 2004, Molecular microbiology.

[71]  Vaidy S. Sunderam,et al.  PVM: A Framework for Parallel Distributed Computing , 1990, Concurr. Pract. Exp..

[72]  Eckart Zitzler,et al.  BicAT: a biclustering analysis toolbox , 2006, Bioinform..

[73]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[74]  Markus J. Herrgård,et al.  Reconstruction of microbial transcriptional regulatory networks. , 2004, Current opinion in biotechnology.

[75]  Benno Schwikowski,et al.  Discovering regulatory and signalling circuits in molecular interaction networks , 2002, ISMB.

[76]  Yitzhak Pilpel,et al.  Comprehensive quantitative analyses of the effects of promoter sequence elements on mRNA transcription , 2003, Nucleic Acids Res..

[77]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[78]  M. Syvanen,et al.  Modification of Helicobacter pylori outer membrane protein expression during experimental infection of rhesus macaques , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[79]  Ron Shamir,et al.  Integrative analysis of genome-wide experiments in the context of a large high-throughput data compendium , 2005, Molecular systems biology.

[80]  D. Goldberg,et al.  Assessing experimentally derived interactions in a small world , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[81]  Hidde de Jong,et al.  Modeling and Simulation of Genetic Regulatory Systems: A Literature Review , 2002, J. Comput. Biol..

[82]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[83]  Jian Su,et al.  Recognizing Names in Biomedical Texts: a Machine Learning Approach , 2004 .

[84]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[85]  Julio Collado-Vides,et al.  RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions , 2005, Nucleic Acids Res..

[86]  Daphne Koller,et al.  Genome-wide discovery of transcriptional modules from DNA sequence and gene expression , 2003, ISMB.

[87]  D. Eisenberg,et al.  Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[88]  M. Schemper,et al.  A solution to the problem of separation in logistic regression , 2002, Statistics in medicine.

[89]  Eric C. Rouchka,et al.  Gibbs Recursive Sampler: finding transcription factor binding sites , 2003, Nucleic Acids Res..

[90]  Anton J. Enright,et al.  Protein interaction maps for complete genomes based on gene fusion events , 1999, Nature.

[91]  Adam J. Smith,et al.  The Database of Interacting Proteins: 2004 update , 2004, Nucleic Acids Res..

[92]  Philip J. Hill,et al.  SirR, a Novel Iron-Dependent Repressor inStaphylococcus epidermidis , 1998, Infection and Immunity.

[93]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[94]  S. Bergmann,et al.  Comparative Gene Expression Analysis by a Differential Clustering Approach: Application to the Candida albicans Transcription Program , 2005, PLoS genetics.

[95]  K. Hughes,et al.  Regulation of flagellar assembly. , 2002, Current opinion in microbiology.

[96]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.