Binding site discovery from nucleic acid sequences by discriminative learning of hidden Markov models

We present a discriminative learning method for pattern discovery of binding sites in nucleic acid sequences based on hidden Markov models. Sets of positive and negative example sequences are mined for sequence motifs whose occurrence frequency varies between the sets. The method offers several objective functions, but we concentrate on mutual information of condition and motif occurrence. We perform a systematic comparison of our method and numerous published motif-finding tools. Our method achieves the highest motif discovery performance, while being faster than most published methods. We present case studies of data from various technologies, including ChIP-Seq, RIP-Chip and PAR-CLIP, of embryonic stem cell transcription factors and of RNA-binding proteins, demonstrating practicality and utility of the method. For the alternative splicing factor RBM10, our analysis finds motifs known to be splicing-relevant. The motif discovery method is implemented in the free software package Discrover. It is applicable to genome- and transcriptome-scale data, makes use of available repeat experiments and aside from binary contrasts also more complex data configurations can be utilized.

[1]  M. Tompa,et al.  Discovery of novel transcription factor binding sites by statistical overrepresentation. , 2002, Nucleic acids research.

[2]  Aaron M. Kershner,et al.  Genome-wide analysis of mRNA targets for Caenorhabditis elegans FBF, a conserved stem cell regulator , 2010, Proceedings of the National Academy of Sciences.

[3]  Fei Yi,et al.  Tcf3 Functions as a Steady‐State Limiter of Transcriptional Programs of Mouse Embryonic Stem Cell Self‐Renewal , 2008, Stem cells.

[4]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[5]  Jun S. Liu,et al.  An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments , 2002, Nature Biotechnology.

[6]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[7]  Bin Shen,et al.  Structural Extension to Logistic Regression: Discriminative Parameter Learning of Belief Net Classifiers , 2002, Machine Learning.

[8]  J. Keene,et al.  Advancing the functional utility of PAR-CLIP by quantifying background binding to mRNAs and lncRNAs , 2014, Genome Biology.

[9]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[10]  J. Nichols,et al.  BMP Induction of Id Proteins Suppresses Differentiation and Sustains Embryonic Stem Cell Self-Renewal in Collaboration with STAT3 , 2003, Cell.

[11]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[12]  Gabriele Varani,et al.  Faculty Opinions recommendation of Systematic discovery of structural elements governing stability of mammalian messenger RNAs. , 2012 .

[13]  Karl Pearson F.R.S. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling , 2009 .

[14]  N. D. Clarke,et al.  Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem Cells , 2008, Cell.

[15]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[16]  S. Tenenbaum,et al.  Identifying mRNA subsets in messenger ribonucleoprotein complexes by using cDNA arrays. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[17]  D. Haussler,et al.  A hidden Markov model that finds genes in E. coli DNA. , 1994, Nucleic acids research.

[18]  R. Fisher On the Interpretation of χ2 from Contingency Tables, and the Calculation of P , 2010 .

[19]  Scott B. Dewell,et al.  Transcriptome-wide Identification of RNA-Binding Protein and MicroRNA Target Sites by PAR-CLIP , 2010, Cell.

[20]  Michael Q. Zhang,et al.  Identifying tissue-selective transcription factor binding sites in vertebrate promoters. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[21]  G. Hong,et al.  Nucleic Acids Research , 2015, Nucleic Acids Research.

[22]  H. Ng,et al.  T‐Cell Factor 3 Regulates Embryonic Stem Cell Pluripotency and Self‐Renewal by the Transcriptional Control of Multiple Lineage Pathways , 2008, Stem cells.

[23]  Saurabh Sinha,et al.  A Statistical Method for Finding Transcription Factor Binding Sites , 2000, ISMB.

[24]  Timothy L. Bailey,et al.  Gene expression Advance Access publication May 4, 2011 DREME: motif discovery in transcription factor ChIP-seq data , 2011 .

[25]  B. Morris,et al.  The zinc fingers of the SR-like protein ZRANB2 are single-stranded RNA-binding domains that recognize 5′ splice site-like sequences , 2009, Proceedings of the National Academy of Sciences.

[26]  Ziv Bar-Joseph,et al.  DECOD: fast and accurate discriminative DNA motif finding , 2011, Bioinform..

[27]  Saurabh Sinha,et al.  YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation , 2003, Nucleic Acids Res..

[28]  K. Pearson On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can be Reasonably Supposed to have Arisen from Random Sampling , 1900 .

[29]  D. S. Fields,et al.  Specificity, free energy and information content in protein-DNA interactions. , 1998, Trends in biochemical sciences.

[30]  A. Philippakis,et al.  Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities , 2006, Nature Biotechnology.

[31]  R. Amann,et al.  Predictive Identification of Exonic Splicing Enhancers in Human Genes , 2022 .

[32]  R. Spriggs,et al.  Identification of a motif that mediates polypyrimidine tract-binding protein-dependent internal ribosome entry. , 2005, Genes & development.

[33]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[35]  P. Brown,et al.  Extensive Association of Functionally and Cytotopically Related mRNAs with Puf Family RNA-Binding Proteins in Yeast , 2004, PLoS biology.

[36]  Radu Dobrin,et al.  Dissecting self-renewal in stem cells with RNA interference , 2006, Nature.

[37]  Sheng Zhong,et al.  A core Klf circuitry regulates self-renewal of embryonic stem cells , 2008, Nature Cell Biology.

[38]  Matthew Mort,et al.  Splicing factor SFRS1 recognizes a functionally diverse landscape of RNA transcripts. , 2009, Genome research.

[39]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[40]  A. Smith,et al.  Self-renewal of pluripotent embryonic stem cells is mediated via activation of STAT3. , 1998, Genes & development.

[41]  Saurabh Sinha,et al.  On counting position weight matrix matches in a sequence, with application to discriminative motif finding , 2006, ISMB.

[42]  H. K. Dai,et al.  A survey of DNA motif finding algorithms , 2007, BMC Bioinformatics.

[43]  Richard A Young,et al.  Tcf3 is an integral component of the core regulatory circuitry of embryonic stem cells. , 2008, Genes & development.

[44]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[45]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[46]  M. Landthaler,et al.  Integrative analysis revealed the molecular mechanism underlying RBM10-mediated splicing regulation , 2013, EMBO molecular medicine.

[47]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[48]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[49]  Francis Y. L. Chin,et al.  Finding motifs from all sequences with and without binding sites , 2006, Bioinform..

[50]  Anders Krogh Hidden Markov models for labeled sequences , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[51]  Henry Tirri,et al.  On Discriminative Bayesian Network Classifiers and Logistic Regression , 2005, Machine Learning.

[52]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[53]  Michael Q. Zhang,et al.  DNA motifs in human and mouse proximal promoters predict tissue-specific expression. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[54]  Timothy L. Bailey,et al.  Discriminative motif discovery in DNA and protein sequences using the DEME algorithm , 2007, BMC Bioinformatics.

[55]  Tyson A. Clark,et al.  HITS-CLIP yields genome-wide insights into brain alternative RNA processing , 2008, Nature.

[56]  Mark Bieda,et al.  Unbiased location analysis of E2F1-binding sites suggests a widespread role for E2F1 in the human genome. , 2006, Genome research.

[57]  M. Gerstein,et al.  The Transcriptional Landscape of the Yeast Genome Defined by RNA Sequencing , 2008, Science.

[58]  Jordan M. Komisarow,et al.  RIP-Chip: the isolation and identification of mRNAs, microRNAs and protein components of ribonucleoprotein complexes from cell extracts , 2006, Nature Protocols.

[59]  Xin Wang,et al.  Predicting sequence and structural specificities of RNA binding regions recognized by splicing factor SRSF1 , 2011, BMC Genomics.

[60]  M. Zavolan,et al.  A quantitative analysis of CLIP methods for identifying binding sites of RNA-binding proteins , 2011, Nature Methods.

[61]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[62]  Nitin R. Patel,et al.  A Network Algorithm for Performing Fisher's Exact Test in r × c Contingency Tables , 1983 .

[63]  Jens Keilwagen,et al.  De-Novo Discovery of Differentially Abundant Transcription Factor Binding Sites Including Their Positional Preference , 2011, PLoS Comput. Biol..

[64]  David J. Thuente,et al.  Line search algorithms with guaranteed sufficient decrease , 1994, TOMS.

[65]  S. Dalton,et al.  LIF/STAT3 controls ES cell self-renewal and pluripotency by a Myc-dependent mechanism , 2005, Development.

[66]  Allen D. Delaney,et al.  Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing , 2007, Nature Methods.

[67]  E. H. Simpson,et al.  The Interpretation of Interaction in Contingency Tables , 1951 .

[68]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[69]  D. Botstein,et al.  Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF , 2001, Nature.

[70]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[71]  J. Valcárcel,et al.  RBM5, 6, and 10 differentially regulate NUMB alternative splicing to control cancer cell proliferation. , 2013, Molecular cell.

[72]  Michael Ruogu Zhang,et al.  Statistical features of human exons and their flanking regions. , 1998, Human molecular genetics.

[73]  Hugues Roest Crollius,et al.  CLIP-seq of eIF4AIII reveals transcriptome-wide mapping of the human exon junction complex , 2012, Nature Structural &Molecular Biology.

[74]  R. Fisher 019: On the Interpretation of x2 from Contingency Tables, and the Calculation of P. , 1922 .

[75]  Nir Friedman,et al.  A Simple Hyper-Geometric Approach for Discovering Putative Transcription Factor Binding Sites , 2001, WABI.

[76]  N. Mukherjee,et al.  Ribonomic Analysis of Human Pum1 Reveals cis-trans Conservation across Species despite Evolution of Diverse mRNA Target Sets , 2008, Molecular and Cellular Biology.

[77]  Megan F. Cole,et al.  Connecting microRNA Genes to the Core Transcriptional Regulatory Circuitry of Embryonic Stem Cells , 2008, Cell.

[78]  William Stafford Noble,et al.  Quantifying similarity between motifs , 2007, Genome Biology.

[79]  G. Church,et al.  Exploring the DNA-binding specificities of zinc fingers with DNA microarrays , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[80]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[81]  Dan Xu,et al.  RBM5 promotes exon 4 skipping of AID pre‐mRNA by competing with the binding of U2AF65 to the polypyrimidine tract , 2012, FEBS letters.

[82]  G. Casari,et al.  A novel bipartite splicing enhancer modulates the differential processing of the human fibronectin EDA exon. , 1994, Nucleic acids research.

[83]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[84]  P. V. von Hippel,et al.  Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. , 1987, Journal of molecular biology.

[85]  Gil Ast,et al.  How did alternative splicing evolve? , 2004, Nature Reviews Genetics.

[86]  Olivier Elemento,et al.  Large-Scale Discovery and Characterization of Protein Regulatory Motifs in Eukaryotes , 2010, PloS one.

[87]  Marvin Wickens,et al.  A PUF family portrait: 3'UTR regulation as a way of life. , 2002, Trends in genetics : TIG.

[88]  A. Inoue,et al.  RBM10 regulates alternative splicing , 2014, FEBS letters.

[89]  J. Szostak,et al.  In vitro selection of RNA molecules that bind specific ligands , 1990, Nature.

[90]  G. Stormo,et al.  Additivity in protein-DNA interactions: how good an approximation is it? , 2002, Nucleic acids research.

[91]  F. James Rohlf,et al.  Biometry: The Principles and Practice of Statistics in Biological Research , 1969 .

[92]  N. Slonim,et al.  A universal framework for regulatory element discovery across all genomes and data types. , 2007, Molecular cell.

[93]  Juan M. Vaquerizas,et al.  Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. , 2010, Genome research.

[94]  Thurston Dart,et al.  The Interpretation of Music , 1955 .

[95]  Mihaela Zavolan,et al.  Comparative Analysis of mRNA Targets for Human PUF-Family Proteins Suggests Extensive Interaction with the miRNA Regulatory System , 2008, PloS one.

[96]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[97]  J. Mackay,et al.  Characterization of a family of RanBP2-type zinc fingers that can recognize single-stranded RNA. , 2011, Journal of molecular biology.

[98]  Ole Winther,et al.  Discovery of Regulatory Elements is Improved by a Discriminatory Approach , 2009, PLoS Comput. Biol..

[99]  John J. Wyrick,et al.  Genome-wide location and function of DNA binding proteins. , 2000, Science.

[100]  Daniel Herschlag,et al.  Genome-wide identification of mRNAs associated with the translational regulator PUMILIO in Drosophila melanogaster. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[101]  Qing Zhou,et al.  Identification of Context-Dependent Motifs by Contrasting ChIP Binding Data , 2010, Bioinform..

[102]  Yariv Ephraim,et al.  Estimation of hidden Markov model parameters by minimizing empirical error rate , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[103]  Alexandre V. Morozov,et al.  Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE , 2006, ISMB.

[104]  Yves Normandin,et al.  Hidden Markov models, maximum mutual information estimation, and the speech recognition problem , 1992 .

[105]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[106]  Michael Q. Zhang,et al.  Analysis of the Vertebrate Insulator Protein CTCF-Binding Sites in the Human Genome , 2007, Cell.

[107]  Michael Q. Zhang,et al.  Mining ChIP-chip data for transcription factor and cofactor binding sites , 2005, ISMB.

[108]  A. Zahler,et al.  A subset of SR proteins activates splicing of the cardiac troponin T alternative exon by direct interactions with an exonic enhancer , 1995, Molecular and cellular biology.

[109]  L. Gold,et al.  Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase. , 1990, Science.

[110]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[111]  Haikady N. Nagaraja,et al.  Inference in Hidden Markov Models , 2006, Technometrics.

[112]  J. Manley,et al.  The human splicing factors ASF/SF2 and SC35 possess distinct, functionally significant RNA binding specificities. , 1995, The EMBO journal.

[113]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[114]  T. Bailey,et al.  Inferring direct DNA binding from ChIP-seq , 2012, Nucleic acids research.

[115]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[116]  S. S. Wilks The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses , 1938 .

[117]  Christoph Dieterich,et al.  doRiNA: a database of RNA interactions in post-transcriptional regulation , 2011, Nucleic Acids Res..

[118]  Hu Guangrui,et al.  ESTIMATION OF HMM PARAMETERS BASED ON GRADIENTS , 2001 .