Learning regulatory motifs by direct optimization of Fisher Exact Test Score

Built upon the hypergeometric distribution, the Fisher Exact Test score (FETS) and its variants offer a natural way of quantifying the level of TF binding site (TFBS) motif enrichment, and have been chosen as the objective functions of several widely used discriminant motif discovery methods, such as HOMER and DREME. In spite of its popularity and efficacy, FETS is non-smooth and non-differentiable, and is thus difficult to optimize numerically. To circumvent this limitation, existing tools that learn to optimize FETS either have to rely on discrete search strategies or indirect tuning of a few external parameters, which could hurt accuracy and fail to fully utilize the potential of input sequences to generate motifs. In this paper, we propose DirectFS, which is (to our best knowledge) the first FETS-based approach that allows direct learning of the motif parameters in continuous space. We show that when the resultant loss function is optimized in a coordinate-wise manner, the cost function of each resultant sub-problem is a piece-wise constant function, whose optimal value can be found exactly and efficiently. Further, a key step in each iteration of DirectFS requires finding the most statistically significant one among tens of thousands of Fisher's exact tests, which is solved efficiently using a novel ‘lookahead’-style algorithm. Experimental evaluations on ENCODE ChIP-seq data illustrate the performance of the proposed approach.

[1]  Gary D. Stormo,et al.  Discriminative motif optimization based on perceptron training , 2014, Bioinform..

[2]  Zhu-Hong You,et al.  Increasing the reliability of protein-protein interaction networks via non-convex semantic embedding , 2013, Neurocomputing.

[3]  J. Söding,et al.  P-value-based regulatory motif discovery using positional weight matrices , 2013, Genome research.

[4]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[5]  De-Shuang Huang,et al.  ChIP-PIT: Enhancing the Analysis of ChIP-Seq Data Using Convex-Relaxed Pair-Wise Interaction Tensor Decomposition , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  Robert Gentleman,et al.  Discriminative motif analysis of high-throughput dataset , 2014, Bioinform..

[7]  Emi Tanaka,et al.  Improving MEME via a two-tiered significance analysis , 2014, Bioinform..

[8]  R. Shamir,et al.  A comparative analysis of transcription factor binding models learned from PBM, HT-SELEX and ChIP data , 2014, Nucleic acids research.

[9]  De-Shuang Huang,et al.  Independent component analysis-based penalized discriminant method for tumor classification using gene expression data , 2006, Bioinform..

[10]  C. Glass,et al.  Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. , 2010, Molecular cell.

[11]  Federico Agostini,et al.  SeAMotE: a method for high-throughput motif discovery in nucleic acid sequences , 2014, BMC Genomics.

[12]  Atina G. Coté,et al.  Evaluation of methods for modeling transcription factor sequence specificity , 2013, Nature Biotechnology.

[13]  Matthew Slattery,et al.  Absence of a simple code: how transcription factors read the genome. , 2014, Trends in biochemical sciences.

[14]  Nikos Vlassis,et al.  FastMotif: spectral sequence motif discovery , 2015, Bioinform..

[15]  Ivo Grosse,et al.  Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data , 2015, BMC Bioinformatics.

[16]  Peng Chen,et al.  Predicting protein interaction sites from residue spatial sequence profile and evolution rate , 2006, FEBS Letters.

[17]  Zohar Yakhini,et al.  Discovering Motifs in Ranked Lists of DNA Sequences , 2007, PLoS Comput. Biol..

[18]  De-Shuang Huang,et al.  A Constructive Hybrid Structure Optimization Methodology for Radial Basis Probabilistic Neural Networks , 2008, IEEE Transactions on Neural Networks.

[19]  Timothy L. Bailey,et al.  Gene expression Advance Access publication May 4, 2011 DREME: motif discovery in transcription factor ChIP-seq data , 2011 .

[20]  P. McNicholas,et al.  Promzea: a pipeline for discovery of co-regulatory motifs in maize and other plant species and its application to the anthocyanin and phlobaphene biosynthetic pathways and the Maize Development Atlas , 2013, BMC Plant Biology.

[21]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[22]  De-Shuang Huang,et al.  Human face recognition based on multi-features using neural networks committee , 2004, Pattern Recognit. Lett..

[23]  Robert D. Finn,et al.  HMMER web server: interactive sequence similarity searching , 2011, Nucleic Acids Res..

[24]  Martin Renqiang Min,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[25]  T. Furey ChIP – seq and beyond : new and improved methodologies to detect and characterize protein – DNA interactions , 2012 .

[26]  Michael A. Beer,et al.  Identification of Predictive Cis-Regulatory Elements Using a Discriminative Objective Function and a Dynamic Search Space , 2015, PloS one.

[27]  Jeffrey Scott Vitter,et al.  An Efficient Exact Algorithm for the Motif Stem Search Problem over Large Alphabets , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[28]  Tetsushi Yada,et al.  Large-scale motif discovery using DNA Gray code and equiprobable oligomers , 2012, Bioinform..

[29]  Michael R. Lyu,et al.  A hybrid particle swarm optimization-back-propagation algorithm for feedforward neural network training , 2007, Appl. Math. Comput..

[30]  Esko Ukkonen,et al.  Finding Significant Matches of Position Weight Matrices in Linear Time , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[31]  Ole Winther,et al.  Discovery of Regulatory Elements is Improved by a Discriminatory Approach , 2009, PLoS Comput. Biol..

[32]  Qing Zhou,et al.  Identification of Context-Dependent Motifs by Contrasting ChIP Binding Data , 2010, Bioinform..

[33]  Timothy L. Bailey,et al.  Motif Enrichment Analysis: a unified framework and an evaluation on ChIP data , 2010, BMC Bioinformatics.

[34]  Donald Geman,et al.  The Limits of De Novo DNA Motif Discovery , 2012, PloS one.

[35]  D.-S. Huang,et al.  Radial Basis Probabilistic Neural Networks: Model and Application , 1999, Int. J. Pattern Recognit. Artif. Intell..

[36]  Inderjit S. Dhillon,et al.  Fast coordinate descent methods with variable selection for non-negative matrix factorization , 2011, KDD.

[37]  Nikolaus Rajewsky,et al.  Binding site discovery from nucleic acid sequences by discriminative learning of hidden Markov models , 2014, Nucleic acids research.