FastMotif: spectral sequence motif discovery

MOTIVATION Sequence discovery tools play a central role in several fields of computational biology. In the framework of Transcription Factor binding studies, most of the existing motif finding algorithms are computationally demanding, and they may not be able to support the increasingly large datasets produced by modern high-throughput sequencing technologies. RESULTS We present FastMotif, a new motif discovery algorithm that is built on a recent machine learning technique referred to as Method of Moments. Based on spectral decompositions, our method is robust to model misspecifications and is not prone to locally optimal solutions. We obtain an algorithm that is extremely fast and designed for the analysis of big sequencing data. On HT-Selex data, FastMotif extracts motif profiles that match those computed by various state-of-the-art algorithms, but one order of magnitude faster. We provide a theoretical and numerical analysis of the algorithm's robustness and discuss its sensitivity with respect to the free parameters. AVAILABILITY AND IMPLEMENTATION The Matlab code of FastMotif is available from http://lcsb-portal.uni.lu/bioinformatics. CONTACT vlassis@adobe.com SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Dean Alderucci A SPECTRAL ALGORITHM FOR LEARNING HIDDEN MARKOV MODELS THAT HAVE SILENT STATES , 2015 .

[2]  Joelle Pineau,et al.  Methods of Moments for Learning Stochastic Languages: Unified Presentation and Empirical Comparison , 2014, ICML.

[3]  Ariadna Quattoni,et al.  Spectral Regularization for Max-Margin Sequence Tagging , 2014, ICML.

[4]  Thierry Mora,et al.  A General Pairwise Interaction Model Provides an Accurate Description of In Vivo Transcription Factor Binding Sites , 2014, PloS one.

[5]  Kevin C. Chen,et al.  Spectacle: Faster and more accurate chromatin state annotation using spectral learning , 2014, bioRxiv.

[6]  R. Shamir,et al.  A comparative analysis of TF binding models learned from PBM, HT-SELEX and ChIP Data , 2014 .

[7]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[8]  Ryan P. Adams,et al.  Contrastive Learning Using Spectral Methods , 2013, NIPS.

[9]  Wyeth W. Wasserman,et al.  The Next Generation of Transcription Factor Binding Site Prediction , 2013, PLoS Comput. Biol..

[10]  Charles Blatti,et al.  Computational Identification of Diverse Mechanisms Underlying Transcription Factor-DNA Occupancy , 2013, PLoS genetics.

[11]  Le Song,et al.  Poly(A) motif prediction using spectral latent features from human DNA sequences , 2013, Bioinform..

[12]  Wing-Kin Sung,et al.  Simultaneously Learning DNA Motif Along with Its Position and Sequence Rank Preferences Through Expectation Maximization Algorithm , 2013, J. Comput. Biol..

[13]  Beyond position weight matrices: nucleotide correlations in transcription factor binding sites and their description , 2013, 1302.4424.

[14]  Juan M. Vaquerizas,et al.  DNA-Binding Specificities of Human Transcription Factors , 2013, Cell.

[15]  Sham M. Kakade,et al.  Learning mixtures of spherical gaussians: moment methods and spectral decompositions , 2012, ITCS '13.

[16]  Donald Geman,et al.  The Limits of De Novo DNA Motif Discovery , 2012, PloS one.

[17]  G. Stormo,et al.  Improved Models for Transcription Factor Binding Site Identification Using Nonindependent Interactions , 2012, Genetics.

[18]  Sham M. Kakade,et al.  Learning Gaussian Mixture Models: Moment Methods and Spectral Decompositions , 2012, arXiv.org.

[19]  Sanjeev Arora,et al.  Learning Topic Models -- Going beyond SVD , 2012, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[20]  Anima Anandkumar,et al.  Learning High-Dimensional Mixtures of Graphical Models , 2012, ArXiv.

[21]  Anima Anandkumar,et al.  A Method of Moments for Mixture Models and Hidden Markov Models , 2012, COLT.

[22]  Michael A. Beer,et al.  Discriminative prediction of mammalian enhancers from DNA sequence. , 2011, Genome research.

[23]  John E. Reid,et al.  STEME: efficient EM to find motifs in large data sets , 2011, Nucleic acids research.

[24]  H. Lähdesmäki,et al.  A Linear Model for Transcription Factor Binding Affinity Prediction in Protein Binding Microarrays , 2011, PloS one.

[25]  G. Stormo,et al.  Quantitative analysis demonstrates most transcription factors require only simple models of specificity , 2011, Nature Biotechnology.

[26]  Timothy L. Bailey,et al.  Gene expression Advance Access publication May 4, 2011 DREME: motif discovery in transcription factor ChIP-seq data , 2011 .

[27]  Byron Boots,et al.  Closing the learning-planning loop with predictive state representations , 2009, Int. J. Robotics Res..

[28]  Juan M. Vaquerizas,et al.  Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. , 2010, Genome research.

[29]  Andrew R. Gehrke,et al.  Genome-wide analysis of ETS-family DNA-binding in vitro and in vivo , 2010, The EMBO journal.

[30]  Timothy L. Bailey,et al.  Motif Enrichment Analysis: a unified framework and an evaluation on ChIP data , 2010, BMC Bioinformatics.

[31]  Yue Zhao,et al.  Inferring Binding Energies from Selected Binding Sites , 2009, PLoS Comput. Biol..

[32]  Daniel E. Newburger,et al.  Diversity and Complexity in DNA Recognition by Transcription Factors , 2009, Science.

[33]  M. Berger,et al.  Universal protein-binding microarrays for the comprehensive characterization of the DNA-binding specificities of transcription factors , 2009, Nature Protocols.

[34]  H. K. Dai,et al.  A survey of DNA motif finding algorithms , 2007, BMC Bioinformatics.

[35]  Xiaoyu Chen,et al.  RankMotif++: a motif-search algorithm that accounts for relative ranks of K-mers in binding transcription factors , 2007, ISMB/ECCB.

[36]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[37]  Finn Drabløs,et al.  Improved benchmarks for computational motif discovery , 2007, BMC Bioinformatics.

[38]  A. Philippakis,et al.  Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities , 2006, Nature Biotechnology.

[39]  William Stafford Noble,et al.  Kernels for gene regulatory regions , 2005, NIPS.

[40]  Elchanan Mossel,et al.  Learning nonsingular phylogenies and hidden Markov models , 2005, STOC '05.

[41]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[42]  G. Church,et al.  Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. , 2002, Nucleic acids research.

[43]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[44]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[45]  Robert M. Corless,et al.  A reordered Schur factorization method for zero-dimensional polynomial systems with multiple roots , 1997, ISSAC.

[46]  Joseph T. Chang,et al.  Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. , 1996, Mathematical biosciences.

[47]  B. Lindsay Mixture models : theory, geometry, and applications , 1995 .

[48]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[49]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[50]  L. Gold,et al.  Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase. , 1990, Science.

[51]  K. Kinzler,et al.  The GLI gene encodes a nuclear protein which binds specific sequences in the human genome , 1990, Molecular and cellular biology.

[52]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[53]  T. D. Schneider,et al.  Quantitative analysis of the relationship between nucleotide sequence and functional activity. , 1986, Nucleic acids research.

[54]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .