DiNAMO: highly sensitive DNA motif discovery in high-throughput sequencing data

BackgroundDiscovering over-represented approximate motifs in DNA sequences is an essential part of bioinformatics. This topic has been studied extensively because of the increasing number of potential applications. However, it remains a difficult challenge, especially with the huge quantity of data generated by high throughput sequencing technologies. To overcome this problem, existing tools use greedy algorithms and probabilistic approaches to find motifs in reasonable time. Nevertheless these approaches lack sensitivity and have difficulties coping with rare and subtle motifs.ResultsWe developed DiNAMO (for DNA MOtif), a new software based on an exhaustive and efficient algorithm for IUPAC motif discovery. We evaluated DiNAMO on synthetic and real datasets with two different applications, namely ChIP-seq peaks and Systematic Sequencing Error analysis. DiNAMO proves to compare favorably with other existing methods and is robust to noise.ConclusionsWe shown that DiNAMO software can serve as a tool to search for degenerate motifs in an exact manner using IUPAC models. DiNAMO can be used in scanning mode with sliding windows or in fixed position mode, which makes it suitable for numerous potential applications.Availabilityhttps://github.com/bonsai-team/DiNAMO.

[1]  Geert Vandeweyer,et al.  pyAmpli: an amplicon-based variant filter pipeline for targeted resequencing data , 2017, BMC Bioinformatics.

[2]  Sven Rahmann,et al.  Efficient exact motif discovery , 2009, Bioinform..

[3]  Nikolaus Rajewsky,et al.  Binding site discovery from nucleic acid sequences by discriminative learning of hidden Markov models , 2014, Nucleic acids research.

[4]  Timothy L Bailey,et al.  A global role for KLF1 in erythropoiesis revealed by ChIP-seq in primary erythroid cells. , 2010, Genome research.

[5]  Mikael Bodén,et al.  MEME Suite: tools for motif discovery and searching , 2009, Nucleic Acids Res..

[6]  Graziano Pesole,et al.  An algorithm for finding signals of unknown length in DNA sequences , 2001, ISMB.

[7]  C. Glass,et al.  Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. , 2010, Molecular cell.

[8]  G. K. Sandve,et al.  A survey of motif discovery methods in an integrated framework , 2006, Biology Direct.

[9]  Wilfred W. Li,et al.  MEME: discovering and analyzing DNA and protein sequence motifs , 2006, Nucleic Acids Res..

[10]  Finn Drabløs,et al.  Improved benchmarks for computational motif discovery , 2007, BMC Bioinformatics.

[11]  N. D. Clarke,et al.  Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem Cells , 2008, Cell.

[12]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[13]  R. Nielsen,et al.  Genomics: In search of rare human variants , 2010, Nature.

[14]  David J. Arenillas,et al.  JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles , 2015, Nucleic Acids Res..

[15]  M. Tompa,et al.  Discovery of novel transcription factor binding sites by statistical overrepresentation. , 2002, Nucleic acids research.

[16]  Sandeep Koranne,et al.  Boost C++ Libraries , 2011 .

[17]  A. Mortazavi,et al.  Computation for ChIP-seq and RNA-seq studies , 2009, Nature Methods.

[18]  N. Slonim,et al.  A universal framework for regulatory element discovery across all genomes and data types. , 2007, Molecular cell.

[19]  Philip Machanick,et al.  MEME-ChIP: motif analysis of large DNA datasets , 2011, Bioinform..

[20]  Alexander Schönhuth,et al.  Discovering motifs that induce sequencing errors , 2013, BMC Bioinformatics.

[21]  P. D’haeseleer How does DNA sequence motif discovery work? , 2006, Nature Biotechnology.

[22]  Denis Thieffry,et al.  RSAT 2015: Regulatory Sequence Analysis Tools , 2015, Nucleic Acids Res..

[23]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[24]  Zaher Dawy,et al.  An approximation to the distribution of finite sample size mutual information estimates , 2005, IEEE International Conference on Communications, 2005. ICC 2005. 2005.

[25]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[26]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[27]  Katharina J Hoff,et al.  The effect of sequencing errors on metagenomic gene prediction , 2009, BMC Genomics.

[28]  B. Thyagarajan,et al.  Review of Clinical Next-Generation Sequencing. , 2017, Archives of pathology & laboratory medicine.

[29]  H. K. Dai,et al.  A survey of DNA motif finding algorithms , 2007, BMC Bioinformatics.

[30]  M. Marra,et al.  Applications of next-generation sequencing technologies in functional genomics. , 2008, Genomics.

[31]  A. Sivachenko,et al.  Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples , 2013, Nature Biotechnology.

[32]  Kenric Leung,et al.  The Life History of 21 Breast Cancers , 2015, Cell.

[33]  E. Birney,et al.  Trawler: de novo regulatory motif discovery pipeline for chromatin immunoprecipitation , 2007, Nature Methods.

[34]  Kathleen Marchal,et al.  A Gibbs sampling method to detect over-represented motifs in the upstream regions of co-expressed genes , 2001, RECOMB.

[35]  N. Risch,et al.  Estimating genotype error rates from high-coverage next-generation sequence data , 2014, Genome research.

[36]  Francesca Chiaromonte,et al.  Erythroid GATA 1 function revealed by genome-wide analysis of transcription factor occupancy , histone modifications , and mRNA expression , 2009 .

[37]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[38]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[39]  Timothy L. Bailey,et al.  Gene expression Advance Access publication May 4, 2011 DREME: motif discovery in transcription factor ChIP-seq data , 2011 .

[40]  Marcus Hutter,et al.  Distribution of Mutual Information , 2001, NIPS.

[41]  Lior Pachter,et al.  Identification and correction of systematic error in high-throughput sequence data , 2011 .

[42]  Marc Salit,et al.  Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing , 2012, PloS one.