A comparative benchmark of classic DNA motif discovery tools on synthetic data

Hundreds of human proteins were found to establish transient interactions with rather degenerated consensus DNA sequences or motifs. Identifying these motifs and the genomic sites where interactions occur represent one of the most challenging research goals in modern molecular biology and bioinformatics. The last twenty years witnessed an explosion of computational tools designed to perform this task, whose performance has been last compared fifteen years ago. Here, we survey sixteen of them, benchmark their ability to identify known motifs nested in twenty-nine simulated sequence datasets, and finally report their strengths, weaknesses, and complementarity.

[1]  Mikhail S. Gelfand,et al.  A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length , 2005, Bioinform..

[2]  Denis Thieffry,et al.  RSAT 2011: regulatory sequence analysis tools , 2011, Nucleic Acids Res..

[3]  De-Shuang Huang,et al.  Locating transcription factor binding sites by fully convolutional neural network , 2021, Briefings Bioinform..

[4]  V. Makeev,et al.  DNA sequence motif: a jack of all trades for ChIP-Seq data. , 2013, Advances in protein chemistry and structural biology.

[5]  Ping Wang,et al.  A Fast Cluster Motif Finding Algorithm for ChIP-Seq Data Sets , 2015, BioMed research international.

[6]  David J. Arenillas,et al.  JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework , 2017, Nucleic acids research.

[7]  Jun S. Liu,et al.  An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments , 2002, Nature Biotechnology.

[8]  Chun-Hsi Huang,et al.  A survey of motif finding Web tools for detecting binding site motifs in ChIP-Seq data , 2014, Biology Direct.

[9]  Vsevolod J. Makeev,et al.  Deep and wide digging for binding motifs in ChIP-Seq data , 2010, Bioinform..

[10]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[11]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[12]  Holger Karas,et al.  TRANSFAC: a database on transcription factors and their DNA binding sites , 1996, Nucleic Acids Res..

[13]  Alexander J. Stewart,et al.  Why Transcription Factor Binding Sites Are Ten Nucleotides Long , 2012, Genetics.

[14]  Zhi Wei,et al.  GAME: detecting cis-regulatory elements using a genetic algorithm , 2006, Bioinform..

[15]  Mohamed Chaabane,et al.  Comprehensive evaluation of deep learning architectures for prediction of DNA/RNA sequence binding specificities , 2019, Bioinform..

[16]  Dianhui Wang,et al.  A comprehensive survey on genetic algorithms for DNA motif prediction , 2018, Inf. Sci..

[17]  Hesham H. Ali,et al.  MTAP: The Motif Tool Assessment Platform , 2008, BMC Bioinformatics.

[18]  Xin Chen,et al.  DMINDA: an integrated web server for DNA motif identification and analyses , 2014, Nucleic Acids Res..

[19]  Wanwan Ge,et al.  The BaMM web server for de-novo motif discovery and regulatory sequence analysis , 2018, Nucleic Acids Res..

[20]  C. Glass,et al.  Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. , 2010, Molecular cell.

[21]  Phillip A. Richmond,et al.  JASPAR 2020: update of the open-access database of transcription factor binding profiles , 2019, Nucleic Acids Res..

[22]  William Stafford Noble,et al.  Motif-based analysis of large nucleotide data sets using MEME-ChIP , 2014, Nature Protocols.

[23]  Gary D Stormo,et al.  DNA Motif Databases and Their Uses , 2015, Current protocols in bioinformatics.

[24]  Hui Liu,et al.  Tmod: toolbox of motif discovery , 2010, Bioinform..

[25]  Patrick Ng,et al.  GIMSAN: a Gibbs motif finder with significance analysis , 2008, Bioinform..

[26]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[27]  F. A. Kolpakov,et al.  HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis , 2017, Nucleic Acids Res..

[28]  S. Holban,et al.  A review of ensemble methods for de novo motif discovery in ChIP-Seq data , 2015, Briefings Bioinform..

[29]  N. Jayaram,et al.  Evaluating tools for transcription factor binding site prediction , 2016, BMC Bioinformatics.

[30]  Ying He,et al.  A survey on deep learning in DNA/RNA motif mining , 2020, Briefings Bioinform..

[31]  Ge Gao,et al.  Identifying complex motifs in massive omics data with a variable-convolutional layer in deep neural network. , 2021, Briefings in bioinformatics.

[32]  Fedor A. Kolpakov,et al.  GTRD: a database on gene transcription regulation—2019 update , 2018, Nucleic Acids Res..

[33]  Graziano Pesole,et al.  Motif discovery and transcription factor binding sites before and after the next-generation sequencing era , 2012, Briefings Bioinform..

[34]  Johannes Söding,et al.  The XXmotif web server for eXhaustive, weight matriX-based motif discovery in nucleotide sequences , 2012, Nucleic Acids Res..

[35]  Ying Xu,et al.  A new framework for identifying cis-regulatory motifs in prokaryotes , 2010, Nucleic acids research.

[36]  Chun-Hsi Huang,et al.  MODSIDE: a motif discovery pipeline and similarity detector , 2018, BMC Genomics.

[37]  P W Garden,et al.  Markov analysis of viral DNA/RNA sequences. , 1980, Journal of theoretical biology.

[38]  Graziano Pesole,et al.  MoD Tools: regulatory motif discovery in nucleotide sequences from co-regulated or homologous genes , 2006, Nucleic Acids Res..

[39]  Finn Drabløs,et al.  Improved benchmarks for computational motif discovery , 2007, BMC Bioinformatics.

[40]  W. J. Kent,et al.  Environmentally Induced Foregut Remodeling by PHA-4/FoxA and DAF-12/NHR , 2004, Science.

[41]  J. Shendure,et al.  Mechanisms of Interplay between Transcription Factors and the 3D Genome. , 2019, Molecular cell.

[42]  Wei Wei,et al.  Comparative Analysis of Regulatory Motif Discovery Tools for Transcription Factor Binding Sites , 2007, Genom. Proteom. Bioinform..

[43]  Kathleen Marchal,et al.  A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling , 2001, Bioinform..

[44]  Nikolay A. Kolchanov,et al.  Argo_CUDA: Exhaustive GPU based approach for motif discovery in large DNA datasets , 2018, J. Bioinform. Comput. Biol..

[45]  Bin Li,et al.  Limitations and potentials of current motif discovery algorithms , 2005, Nucleic acids research.

[46]  Simon J. van Heeringen,et al.  GimmeMotifs: a de novo motif prediction pipeline for ChIP-sequencing experiments , 2010, Bioinform..

[47]  Yang Li,et al.  An algorithmic perspective of de novo cis-regulatory motif finding based on ChIP-seq data , 2017, Briefings Bioinform..

[48]  L. Deng,et al.  DeepD2V: A Novel Deep Learning-Based Framework for Predicting Transcription Factor Binding Sites from Combined DNA Sequence , 2021, International journal of molecular sciences.

[49]  Sven Rahmann,et al.  Efficient exact motif discovery , 2009, Bioinform..

[50]  Julio Collado-Vides,et al.  RegulonDB v 10.5: tackling challenges to unify classic and high throughput knowledge of gene regulation in E. coli K-12 , 2018, Nucleic Acids Res..

[51]  Mikael Bodén,et al.  MEME Suite: tools for motif discovery and searching , 2009, Nucleic Acids Res..

[52]  Mathieu Blanchette,et al.  Seeder: discriminative seeding DNA motif discovery , 2008, Bioinform..

[53]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[54]  H. K. Dai,et al.  A survey of DNA motif finding algorithms , 2007, BMC Bioinformatics.