Identification of Predictive Cis-Regulatory Elements Using a Discriminative Objective Function and a Dynamic Search Space

The generation of genomic binding or accessibility data from massively parallel sequencing technologies such as ChIP-seq and DNase-seq continues to accelerate. Yet state-of-the-art computational approaches for the identification of DNA binding motifs often yield motifs of weak predictive power. Here we present a novel computational algorithm called MotifSpec, designed to find predictive motifs, in contrast to over-represented sequence elements. The key distinguishing feature of this algorithm is that it uses a dynamic search space and a learned threshold to find discriminative motifs in combination with the modeling of motifs using a full PWM (position weight matrix) rather than k-mer words or regular expressions. We demonstrate that our approach finds motifs corresponding to known binding specificities in several mammalian ChIP-seq datasets, and that our PWMs classify the ChIP-seq signals with accuracy comparable to, or marginally better than motifs from the best existing algorithms. In other datasets, our algorithm identifies novel motifs where other methods fail. Finally, we apply this algorithm to detect motifs from expression datasets in C. elegans using a dynamic expression similarity metric rather than fixed expression clusters, and find novel predictive motifs.

[1]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[2]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[3]  Benjamin J. Strober,et al.  A method to predict the impact of regulatory variants from DNA sequence , 2015, Nature Genetics.

[4]  N. Slonim,et al.  A universal framework for regulatory element discovery across all genomes and data types. , 2007, Molecular cell.

[5]  Nathan C. Sheffield,et al.  Open chromatin defined by DNaseI and FAIRE identifies regulatory elements that shape cell-type identity. , 2011, Genome research.

[6]  Donald Geman,et al.  The Limits of De Novo DNA Motif Discovery , 2012, PloS one.

[7]  Morteza Mohammad Noori,et al.  Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[8]  Michael A. Beer,et al.  Predicting Gene Expression from Sequence , 2004, Cell.

[9]  R. Shamir,et al.  Transcription factor and microRNA motif discovery: the Amadeus platform and a compendium of metazoan target sets. , 2008, Genome research.

[10]  Dmitri D. Pervouchine,et al.  Enhanced transcriptome maps from multiple mouse tissues reveal evolutionary constraint in gene expression , 2014, Nature Communications.

[11]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[12]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[13]  C. Glass,et al.  Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. , 2010, Molecular cell.

[14]  H. Stunnenberg,et al.  ChIP‐Seq of ERα and RNA polymerase II defines genes differentially responding to ligands , 2009, The EMBO journal.

[15]  R. Shamir,et al.  A novel candidate cis-regulatory motif pair in the promoters of germline and oogenesis genes in C. elegans. , 2012, Genome research.

[16]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[17]  Nancy Kleckner,et al.  Cohesins Bind to Preferential Sites along Yeast Chromosome III, with Differential Regulation along Arms versus the Centric Region , 1999, Cell.

[18]  Wilfred W. Li,et al.  MEME: discovering and analyzing DNA and protein sequence motifs , 2006, Nucleic Acids Res..

[19]  Ziv Bar-Joseph,et al.  DECOD: fast and accurate discriminative DNA motif finding , 2011, Bioinform..

[20]  E. Brown,et al.  Genomic analysis of gene expression in C. elegans. , 2000, Science.

[21]  Jens Keilwagen,et al.  A general approach for discriminative de novo motif discovery from high-throughput data , 2013, GCB.

[22]  Robert Gentleman,et al.  Discriminative motif analysis of high-throughput dataset , 2014, Bioinform..

[23]  Thomas R. Gingeras,et al.  Comparison of the transcriptional landscapes between human and mouse tissues , 2014, Proceedings of the National Academy of Sciences.

[24]  Matthew J. Brauer,et al.  Coordination of growth rate, cell cycle, stress response, and metabolic activity in yeast. , 2008, Molecular biology of the cell.

[25]  Timothy L. Bailey,et al.  Discriminative motif discovery in DNA and protein sequences using the DEME algorithm , 2007, BMC Bioinformatics.

[26]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[27]  Timothy L. Bailey,et al.  Gene expression Advance Access publication May 4, 2011 DREME: motif discovery in transcription factor ChIP-seq data , 2011 .

[28]  A. Philippakis,et al.  Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities , 2006, Nature Biotechnology.

[29]  D L Riddle,et al.  Gene expression profiling of cells, tissues, and developmental stages of the nematode C. elegans. , 2003, Cold Spring Harbor symposia on quantitative biology.

[30]  Dongwon Lee,et al.  kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets , 2013, Nucleic Acids Res..

[31]  Mathieu Blanchette,et al.  Seeder: discriminative seeding DNA motif discovery , 2008, Bioinform..

[32]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[33]  N. D. Clarke,et al.  Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem Cells , 2008, Cell.

[34]  Martha L. Bulyk,et al.  UniPROBE: an online database of protein binding microarray data on protein–DNA interactions , 2008, Nucleic Acids Res..

[35]  Michael A. Beer,et al.  Discriminative prediction of mammalian enhancers from DNA sequence. , 2011, Genome research.

[36]  Raymond K. Auerbach,et al.  Integrative Analysis of the Caenorhabditis elegans Genome by the modENCODE Project , 2010, Science.

[37]  Gary D. Stormo,et al.  Discriminative motif optimization based on perceptron training , 2014, Bioinform..

[38]  Kimberly Van Auken,et al.  WormBase: a comprehensive resource for nematode research , 2009, Nucleic Acids Res..

[39]  Shane J. Neph,et al.  A comparative encyclopedia of DNA elements in the mouse genome , 2014, Nature.

[40]  Daniel E. Newburger,et al.  High-resolution DNA-binding specificity analysis of yeast transcription factors. , 2009, Genome research.

[41]  Mark Gerstein,et al.  Diverse transcription factor binding features revealed by genome-wide ChIP-seq in C. elegans. , 2011, Genome research.

[42]  G. Stormo,et al.  Novel transcription regulatory elements in Caenorhabditis elegans muscle genes. , 2004, Genome research.

[43]  M. Boxem,et al.  C. elegans Class B Synthetic Multivulva Genes Act in G1 Regulation , 2002, Current Biology.

[44]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[45]  Michael A. Beer,et al.  Integration of ChIP-seq and machine learning reveals enhancers and a predictive regulatory sequence vocabulary in melanocytes , 2012, Genome research.

[46]  D. Slonim,et al.  Composition and dynamics of the Caenorhabditis elegans early embryonic transcriptome , 2003, Development.

[47]  H. Horvitz,et al.  The C. elegans protein CEH-30 protects male-specific neurons from apoptosis independently of the Bcl-2 homolog CED-9. , 2007, Genes & development.

[48]  Robert L. Grossman,et al.  A cis-regulatory map of the Drosophila genome , 2011, Nature.