Improved accuracy of supervised CRM discovery with interpolated Markov models and cross-species comparison

Despite recent advances in experimental approaches for identifying transcriptional cis-regulatory modules (CRMs, ‘enhancers’), direct empirical discovery of CRMs for all genes in all cell types and environmental conditions is likely to remain an elusive goal. Effective methods for computational CRM discovery are thus a critically needed complement to empirical approaches. However, existing computational methods that search for clusters of putative binding sites are ineffective if the relevant TFs and/or their binding specificities are unknown. Here, we provide a significantly improved method for ‘motif-blind’ CRM discovery that does not depend on knowledge or accurate prediction of TF-binding motifs and is effective when limited knowledge of functional CRMs is available to ‘supervise’ the search. We propose a new statistical method, based on ‘Interpolated Markov Models’, for motif-blind, genome-wide CRM discovery. It captures the statistical profile of variable length words in known CRMs of a regulatory network and finds candidate CRMs that match this profile. The method also uses orthologs of the known CRMs from closely related genomes. We perform in silico evaluation of predicted CRMs by assessing whether their neighboring genes are enriched for the expected expression patterns. This assessment uses a novel statistical test that extends the widely used Hypergeometric test of gene set enrichment to account for variability in intergenic lengths. We find that the new CRM prediction method is superior to existing methods. Finally, we experimentally validate 12 new CRM predictions by examining their regulatory activity in vivo in Drosophila; 10 of the tested CRMs were found to be functional, while 6 of the top 7 predictions showed the expected activity patterns. We make our program available as downloadable source code, and as a plugin for a genome browser installed on our servers.

[1]  E. Davidson The Regulatory Genome: Gene Regulatory Networks In Development And Evolution , 2006 .

[2]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[3]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[4]  Marc S. Halfon,et al.  Prediction of similarly acting cis-regulatory modules by subsequence profiling and comparative genomics in Drosophila melanogaster and D.pseudoobscura , 2004, Bioinform..

[5]  Finn Drabløs,et al.  Improved benchmarks for computational motif discovery , 2007, BMC Bioinformatics.

[6]  Berthold Göttgens,et al.  TFBScluster: a resource for the characterization of transcriptional regulatory networks , 2005, Bioinform..

[7]  S. Lewis,et al.  The generic genome browser: a building block for a model organism system database. , 2002, Genome research.

[8]  M. Eisen,et al.  Identifying Cis-Regulatory Sequences by Word Profile Similarity , 2009, PloS one.

[9]  A. Philippakis,et al.  Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities , 2006, Nature Biotechnology.

[10]  A. Visel,et al.  ChIP-seq accurately predicts tissue-specific activity of enhancers , 2009, Nature.

[11]  BMC Bioinformatics , 2005 .

[12]  Jacques van Helden,et al.  Metrics for comparing regulatory sequences on the basis of pattern counts , 2004, Bioinform..

[13]  Ivan Ovcharenko,et al.  Variable locus length in the human genome leads to ascertainment bias in functional inference for non-coding elements , 2009, Bioinform..

[14]  Z. Weng,et al.  Detection of functional DNA motifs via statistical over-representation. , 2004, Nucleic acids research.

[15]  Anthony A. Philippakis,et al.  ModuleFinder: A Tool for Computational Discovery of Cis Regulatory Modules , 2004, Pacific Symposium on Biocomputing.

[16]  Steven M. Gallo,et al.  REDfly v3.0: toward a comprehensive database of transcriptional regulatory elements in Drosophila , 2010, Nucleic Acids Res..

[17]  Zhiping Weng,et al.  Analysis of overrepresented motifs in human core promoters reveals dual regulatory roles of YY1. , 2007, Genome research.

[18]  Diego Miranda-Saavedra,et al.  Motif-blind, genome-wide discovery of cis-regulatory modules in Drosophila and mouse. , 2009, Developmental cell.

[19]  Saurabh Sinha,et al.  A statistical method for alignment-free comparison of regulatory sequences , 2007, ISMB/ECCB.

[20]  G. Stormo,et al.  Identification of a novel cis-regulatory element involved in the heat shock response in Caenorhabditis elegans using microarray gene expression and computational methods. , 2002, Genome research.

[21]  Steven M. Gallo,et al.  REDfly 2.0: an integrated database of cis-regulatory modules and transcription factor binding sites in Drosophila , 2007, Nucleic Acids Res..

[22]  E. Davidson Genomic Regulatory Systems , 2001 .

[23]  Holger Karas,et al.  TRANSFAC: a database on transcription factors and their DNA binding sites , 1996, Nucleic Acids Res..

[24]  V. Hakim,et al.  Genome-wide identification of cis-regulatory motifs and modules underlying gene coregulation using statistics and phylogeny , 2010, Proceedings of the National Academy of Sciences.

[25]  Martin C. Frith,et al.  Cluster-Buster: finding dense clusters of motifs in DNA sequences , 2003, Nucleic Acids Res..

[26]  G. Rubin,et al.  Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[27]  D. W. Knowles,et al.  Transcription Factors Bind Thousands of Active and Inactive Regions in the Drosophila Blastoderm , 2008, PLoS biology.

[28]  I. Darboux,et al.  Amalgam is a ligand for the transmembrane receptor neurotactin and is required for neurotactin‐mediated cell adhesion and axon fasciculation in Drosophila , 2000, The EMBO journal.

[29]  Marc S Halfon,et al.  Computational discovery of cis-regulatory modules in Drosophila without prior knowledge of motifs , 2008, Genome Biology.

[30]  Pavel Tomancak,et al.  An alignment-free method to identify candidate orthologous enhancers in multiple Drosophila genomes , 2010, Bioinform..

[31]  Dennis F. Kibler,et al.  Using hexamers to predict cis-regulatory motifs in Drosophila , 2005, BMC Bioinformatics.

[32]  Saurabh Sinha,et al.  FlyFactorSurvey: a database of Drosophila transcription factor binding specificities determined using the bacterial one-hybrid system , 2010, Nucleic Acids Res..

[33]  Saurabh Sinha,et al.  A probabilistic method to detect regulatory modules , 2003, ISMB.

[34]  Bin Li,et al.  Limitations and potentials of current motif discovery algorithms , 2005, Nucleic acids research.

[35]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[36]  G. Rubin,et al.  Global analysis of patterns of gene expression during Drosophila embryogenesis , 2007, Genome Biology.

[37]  Ivan Ovcharenko,et al.  Predicting tissue-specific enhancers in the human genome. , 2006, Genome research.

[38]  Ole Winther,et al.  JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update , 2007, Nucleic Acids Res..

[39]  Casey M. Bergman,et al.  Drosophila DNase I footprint database: a systematic genome annotation of transcription factor binding sites in the fruitfly, Drosophila melanogaster , 2005, Bioinform..

[40]  Saurabh Sinha,et al.  Cross-species comparison significantly improves genome-wide prediction of cis-regulatory modules in Drosophila , 2004, BMC Bioinformatics.

[41]  David Osumi-Sutherland,et al.  FlyBase: enhancing Drosophila Gene Ontology annotations , 2008, Nucleic Acids Res..

[42]  John M Westlund,et al.  Genome-wide discovery of human heart enhancers. , 2010, Genome research.