Fast and exact quantification of motif occurrences in biological sequences

Background Identification of motifs and quantification of their occurrences are important for the study of genetic diseases, gene evolution, transcription sites, and other biological mechanisms. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high computational complexity and are impractical to be used on large motif sets. Approximated formulae, e.g. based on compound Poisson, are faster, but reliable p value calculation remains challenging. Here, we introduce ‘motif_prob’, a fast implementation of an exact formula for motif count distribution through progressive approximation with arbitrary precision. Our implementation speeds up the exact calculation, usually impractical, making it feasible and posit to substitute currently employed heuristics. Results We implement motif_prob in both Perl and C+ + languages, using an efficient error-bound iterative process for the exact formula, providing comparison with state-of-the-art tools (e.g. MoSDi) in terms of precision, run time benchmarks, along with a real-world use case on bacterial motif characterization. Our software is able to process a million of motifs (13–31 bases) over genome lengths of 5 million bases within the minute on a regular laptop, and the run times for both the Perl and C+ + code are several orders of magnitude smaller (50–1000× faster) than MoSDi, even when using their fast compound Poisson approximation (60–120× faster). In the real-world use cases, we first show the consistency of motif_prob with MoSDi, and then how the p-value quantification is crucial for enrichment quantification when bacteria have different GC content, using motifs found in antimicrobial resistance genes. The software and the code sources are available under the MIT license at https://github.com/DataIntellSystLab/motif_prob . Conclusions The motif_prob software is a multi-platform and efficient open source solution for calculating exact frequency distributions of motifs. It can be integrated with motif discovery/characterization tools for quantifying enrichment and deviation from expected frequency ranges with exact p values, without loss in data processing efficiency.

[1]  J. van Helden,et al.  RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets , 2011, Nucleic acids research.

[2]  Graziano Pesole,et al.  Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes , 2004, Nucleic Acids Res..

[3]  Martin Vingron,et al.  Compound Poisson Approximation of the Number of Occurrences of a Position Frequency Matrix (PFM) on Both Strands , 2008, J. Comput. Biol..

[4]  Mireille Régnier,et al.  A unified approach to word occurrence probabilities , 2000, Discret. Appl. Math..

[5]  Ole Lund,et al.  Benchmarking of methods for identification of antimicrobial resistance genes in bacterial whole genome data. , 2016, The Journal of antimicrobial chemotherapy.

[6]  Timothy L. Bailey,et al.  Gene expression Advance Access publication May 4, 2011 DREME: motif discovery in transcription factor ChIP-seq data , 2011 .

[7]  Naftali Tishby,et al.  Efficient Exact p-Value Computation for Small Sample, Sparse, and Surprising Categorical Data , 2004, J. Comput. Biol..

[8]  G. Fogel,et al.  Discovery of sequence motifs related to coexpression of genes using evolutionary computation. , 2004, Nucleic acids research.

[9]  E. Birney,et al.  Trawler: de novo regulatory motif discovery pipeline for chromatin immunoprecipitation , 2007, Nature Methods.

[10]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[11]  Sven Rahmann,et al.  Speeding Up Exact Motif Discovery by Bounding the Expected Clump Size , 2010, WABI.

[12]  Michael Q. Zhang,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btl662 Sequence analysis Computing exact P-values for DNA motifs , 2022 .

[13]  André Yoshiaki Kashiwabara,et al.  Sequence motif finder using memetic algorithm , 2018, BMC Bioinformatics.

[14]  Guojun Li,et al.  ProSampler: an ultrafast and accurate motif finder in large ChIP-seq datasets for combinatory motif discovery , 2019, Bioinform..

[15]  Timothy L. Bailey,et al.  STREME: Accurate and versatile sequence motif discovery , 2020, bioRxiv.

[16]  Sven Rahmann,et al.  Efficient exact motif discovery , 2009, Bioinform..

[17]  Christina Boucher,et al.  MEGARes 2.0: a database for classification of antimicrobial drug, biocide and metal resistance determinants in metagenomic sequence data , 2019, Nucleic Acids Res..

[18]  Bin Ma,et al.  Finding similar regions in many strings , 1999, STOC '99.

[19]  Jean-Jacques Daudin,et al.  Occurrence Probability of Structured Motifs in Random Sequences , 2002, J. Comput. Biol..

[20]  Mikael Bodén,et al.  MEME Suite: tools for motif discovery and searching , 2009, Nucleic Acids Res..

[21]  R. Mullin,et al.  The distribution of the frequency of occurrence of nucleotide subsequences, based on their overlap capability. , 1989, Biometrics.

[23]  Philippe Flajolet,et al.  Motif Statistics , 1999, ESA.

[24]  Yaquan Wang,et al.  Species-level evaluation of the human respiratory microbiome , 2020, GigaScience.

[25]  Walid Al-Atabany,et al.  Review of Different Sequence Motif Finding Algorithms , 2019, Avicenna journal of medical biotechnology.

[26]  Wanwan Ge,et al.  The BaMM web server for de-novo motif discovery and regulatory sequence analysis , 2018, Nucleic Acids Res..

[27]  M. Araúzo-Bravo,et al.  Disclosing the crosstalk among DNA methylation, transcription factors, and histone marks in human pluripotent cells through discovery of DNA methylation motifs , 2013, Genome research.

[28]  Sven Rahmann,et al.  Combinatorics of periods in strings , 2001, J. Comb. Theory A.

[29]  Philippe Flajolet,et al.  Motif statistics , 1999, Theor. Comput. Sci..

[30]  Dianhui Wang,et al.  A comprehensive survey on genetic algorithms for DNA motif prediction , 2018, Inf. Sci..

[31]  William Stafford Noble,et al.  Quantifying similarity between motifs , 2007, Genome Biology.

[32]  Manuel E Lladser,et al.  Multiple pattern matching: a Markov chain approach , 2007, Journal of mathematical biology.

[33]  Louis T. Dang,et al.  TrawlerWeb: an online de novo motif discovery tool for next-generation sequencing datasets , 2018, BMC Genomics.

[34]  F. Hildebrand,et al.  Evidence of Selection upon Genomic GC-Content in Bacteria , 2010, PLoS genetics.

[35]  Huda Y. Zoghbi,et al.  Diseases of Unstable Repeat Expansion: Mechanisms and Common Principles , 2005, Nature Reviews Genetics.