PairMotifChIP: A Fast Algorithm for Discovery of Patterns Conserved in Large ChIP-seq Data Sets

Identifying conserved patterns in DNA sequences, namely, motif discovery, is an important and challenging computational task. With hundreds or more sequences contained, the high-throughput sequencing data set is helpful to improve the identification accuracy of motif discovery but requires an even higher computing performance. To efficiently identify motifs in large DNA data sets, a new algorithm called PairMotifChIP is proposed by extracting and combining pairs of l-mers in the input with relatively small Hamming distance. In particular, a method for rapidly extracting pairs of l-mers is designed, which can be used not only for PairMotifChIP, but also for other DNA data mining tasks with the same demand. Experimental results on the simulated data show that the proposed algorithm can find motifs successfully and runs faster than the state-of-the-art motif discovery algorithms. Furthermore, the validity of the proposed algorithm has been verified on real data.

[1]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[2]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[3]  Sanguthevar Rajasekaran,et al.  Efficient sequential and parallel algorithms for planted motif search , 2013, BMC Bioinformatics.

[4]  Jeffrey Scott Vitter,et al.  Reference sequence selection for motif searches , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[5]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[6]  John E. Reid,et al.  STEME: efficient EM to find motifs in large data sets , 2011, Nucleic acids research.

[7]  Qiang Yu,et al.  PairMotif+: A Fast and Effective Algorithm for De Novo Motif Discovery in DNA sequences , 2013, International journal of biological sciences.

[8]  Jeremy Buhler,et al.  Finding motifs using random projections , 2001, RECOMB.

[9]  Philip Machanick,et al.  MEME-ChIP: motif analysis of large DNA datasets , 2011, Bioinform..

[10]  Caiyan Jia,et al.  A New Exhaustive Method and Strategy for Finding Motifs in ChIP-Enriched Regions , 2014, PloS one.

[11]  Graziano Pesole,et al.  Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes , 2004, Nucleic Acids Res..

[12]  S. Holban,et al.  A review of ensemble methods for de novo motif discovery in ChIP-Seq data , 2015, Briefings Bioinform..

[13]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[14]  Sanguthevar Rajasekaran,et al.  qPMS9: An Efficient Algorithm for Quorum Planted Motif Search , 2015, Scientific Reports.

[15]  Qiang Yu,et al.  PairMotif: A New Pattern-Driven Algorithm for Planted (l, d) DNA Motif Search , 2012, PloS one.

[16]  W. Wasserman,et al.  Identification of altered cis-regulatory elements in human disease. , 2015, Trends in genetics : TIG.

[17]  A. Sharov,et al.  Exhaustive Search for Over-represented DNA Sequence Motifs with CisFinder , 2009, DNA research : an international journal for rapid publication of reports on genes and genomes.

[18]  N. D. Clarke,et al.  Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem Cells , 2008, Cell.

[19]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[20]  Sun-Yuan Hsieh,et al.  An Improved Heuristic Algorithm for Finding Motif Signals in DNA Sequences , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[21]  H. K. Dai,et al.  A survey of DNA motif finding algorithms , 2007, BMC Bioinformatics.

[22]  Todd Wareham,et al.  On the complexity of finding common approximate substrings , 2003, Theor. Comput. Sci..

[23]  Jeremy Buhler,et al.  Finding Motifs Using Random Projections , 2002, J. Comput. Biol..

[24]  J. van Helden,et al.  RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets , 2011, Nucleic acids research.

[25]  S. Dongen Graph clustering by flow simulation , 2000 .

[26]  E. Mardis ChIP-seq: welcome to the new frontier , 2007, Nature Methods.

[27]  Tiejun Tong,et al.  A short survey of computational analysis methods in analysing ChIP-seq data , 2010, Human Genomics.

[28]  Jeffrey Scott Vitter,et al.  An Efficient Algorithm for Discovering Motifs in Large DNA Data Sets , 2015, IEEE Transactions on NanoBioscience.

[29]  Wilfred W. Li,et al.  MEME: discovering and analyzing DNA and protein sequence motifs , 2006, Nucleic Acids Res..

[30]  P. D’haeseleer What are DNA sequence motifs? , 2006, Nature Biotechnology.

[31]  D. Guhathakurta,et al.  Computational identification of transcriptional regulatory elements in DNA sequence , 2006, Nucleic acids research.

[32]  Graziano Pesole,et al.  Motif discovery and transcription factor binding sites before and after the next-generation sequencing era , 2012, Briefings Bioinform..

[33]  Giulio Pavesi,et al.  A Faster Algorithm for Motif Finding in Sequences from ChIP-Seq Data , 2011, CIBB.