Direct AUC optimization of regulatory motifs

Motivation: The discovery of transcription factor binding site (TFBS) motifs is essential for untangling the complex mechanism of genetic variation under different developmental and environmental conditions. Among the huge amount of computational approaches for de novo identification of TFBS motifs, discriminative motif learning (DML) methods have been proven to be promising for harnessing the discovery power of accumulated huge amount of high‐throughput binding data. However, they have to sacrifice accuracy for speed and could fail to fully utilize the information of the input sequences. Results: We propose a novel algorithm called CDAUC for optimizing DML‐learned motifs based on the area under the receiver‐operating characteristic curve (AUC) criterion, which has been widely used in the literature to evaluate the significance of extracted motifs. We show that when the considered AUC loss function is optimized in a coordinate‐wise manner, the cost function of each resultant sub‐problem is a piece‐wise constant function, whose optimal value can be found exactly and efficiently. Further, a key step of each iteration of CDAUC can be efficiently solved as a computational geometry problem. Experimental results on real world high‐throughput datasets illustrate that CDAUC outperforms competing methods for refining DML motifs, while being one order of magnitude faster. Meanwhile, preliminary results also show that CDAUC may also be useful for improving the interpretability of convolutional kernels generated by the emerging deep learning approaches for predicting TF sequences specificities. Availability and Implementation: CDAUC is available at: https://drive.google.com/drive/folders/0BxOW5MtIZbJjNFpCeHlBVWJHeW8. Contact: dshuang@tongji.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Kuldip K. Paliwal,et al.  Proposing a highly accurate protein structural class predictor using segmentation-based features , 2014, BMC Genomics.

[2]  C. Glass,et al.  Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. , 2010, Molecular cell.

[3]  Timothy L. Bailey,et al.  Gene expression Advance Access publication May 4, 2011 DREME: motif discovery in transcription factor ChIP-seq data , 2011 .

[4]  Johannes Söding,et al.  Bayesian Markov models consistently outperform PWMs at predicting motifs in nucleotide sequences , 2016, bioRxiv.

[5]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[6]  Federico Agostini,et al.  SeAMotE: a method for high-throughput motif discovery in nucleic acid sequences , 2014, BMC Genomics.

[7]  Atina G. Coté,et al.  Evaluation of methods for modeling transcription factor sequence specificity , 2013, Nature Biotechnology.

[8]  Zhen Gao,et al.  Computational modeling of in vivo and in vitro protein‐DNA interactions by multiple instance learning , 2017, Bioinform..

[9]  William M. Graham,et al.  Were Multiple Stressors a ‘Perfect Storm’ for Northern Gulf of Mexico Bottlenose Dolphins (Tursiops truncatus) in 2011? , 2012, PloS one.

[10]  Yu Liang,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm080 Sequence analysis , 2022 .

[11]  Robert D. Finn,et al.  HMMER web server: interactive sequence similarity searching , 2011, Nucleic Acids Res..

[12]  Jens Keilwagen,et al.  A general approach for discriminative de novo motif discovery from high-throughput data , 2013, GCB.

[13]  Robert Gentleman,et al.  Discriminative motif analysis of high-throughput dataset , 2014, Bioinform..

[14]  R. Shamir,et al.  A comparative analysis of transcription factor binding models learned from PBM, HT-SELEX and ChIP data , 2014, Nucleic acids research.

[15]  T. Bailey,et al.  Differential motif enrichment analysis of paired ChIP-seq experiments , 2014, BMC Genomics.

[16]  Donald Geman,et al.  The Limits of De Novo DNA Motif Discovery , 2012, PloS one.

[17]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[18]  Ryo Yoshida,et al.  Repulsive parallel MCMC algorithm for discovering diverse motifs from large sequence sets , 2015, Bioinform..

[19]  Gary D. Stormo,et al.  Discriminative motif optimization based on perceptron training , 2014, Bioinform..

[20]  F. Frances Yao,et al.  Computational Geometry , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[21]  David K. Gifford,et al.  Convolutional neural network architectures for predicting DNA–protein binding , 2016, Bioinform..

[22]  Inderjit S. Dhillon,et al.  Fast coordinate descent methods with variable selection for non-negative matrix factorization , 2011, KDD.

[23]  William Stafford Noble,et al.  Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors , 2012, Genome research.

[24]  Pankaj K. Agarwal,et al.  Geometric Range Searching and Its Relatives , 2007 .

[25]  Xifeng Yan,et al.  Fast motif discovery in short sequences , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[26]  Morteza Mohammad Noori,et al.  Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[27]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[28]  R. Rohs,et al.  A widespread role of the motif environment in transcription factor binding across diverse protein families , 2015, Genome research.

[29]  Timothy L. Bailey,et al.  Motif Enrichment Analysis: a unified framework and an evaluation on ChIP data , 2010, BMC Bioinformatics.

[30]  Christina S. Leslie,et al.  SeqGL Identifies Context-Dependent Binding Signals in Genome-Wide Regulatory Element Maps , 2015, PLoS Comput. Biol..

[31]  Alastair M. Kilpatrick,et al.  Stochastic EM-based TFBS motif discovery with MITSU , 2014, Bioinform..

[32]  Ole Winther,et al.  Discovery of Regulatory Elements is Improved by a Discriminatory Approach , 2009, PLoS Comput. Biol..

[33]  Harikrishna Narasimhan,et al.  A Structural SVM Based Approach for Optimizing Partial AUC , 2013, ICML.

[34]  T. Bailey,et al.  Inferring direct DNA binding from ChIP-seq , 2012, Nucleic acids research.

[35]  Nikolaus Rajewsky,et al.  Binding site discovery from nucleic acid sequences by discriminative learning of hidden Markov models , 2014, Nucleic acids research.

[36]  Emi Tanaka,et al.  Improving MEME via a two-tiered significance analysis , 2014, Bioinform..

[37]  Qing Zhou,et al.  Identification of Context-Dependent Motifs by Contrasting ChIP Binding Data , 2010, Bioinform..

[38]  Zhi-Hua Zhou,et al.  One-Pass AUC Optimization , 2013, ICML.

[39]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.