Parallel Position Weight Matrices Algorithms

Position Weight Matrices (PWMs) are broadly used in computational biology. The basic problem, SCAN, aims to find the occurrences of a given PWM in large sequences. Some other PWM tasks share a common NP-hard subproblem, SCOREDISTRIBUTION. The existing algorithms rely on the enumeration on a large set of scores or words, and they are mostly not suitable for parallelization.We propose a new algorithm, BUCKETSCOREDISTRIBUTION, that is both very efficient and suitable for parallelization.We bound the error induced by this algorithm. We realized a GPU prototype for SCAN and BUCKETSCOREDISTRIBUTION with the CUDA libraries, and report for the different problems speedups of 21x and 77x on a Nvidia GTX 280.

[1]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[2]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[3]  Esko Ukkonen,et al.  Fast Search Algorithms for Position Specific Scoring Matrices , 2007, BIRD.

[4]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[5]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[6]  Donald Ervin Knuth,et al.  The Art of Computer Programming, Volume II: Seminumerical Algorithms , 1970 .

[7]  Yongchao Liu,et al.  CUDA-MEME: Accelerating motif discovery in biological sequences using CUDA-enabled graphics processing units , 2010, Pattern Recognit. Lett..

[8]  Alessandro Bogliolo,et al.  Using sequence compression to speedup probabilistic profile matching , 2005, Bioinform..

[9]  William Stafford Noble,et al.  Quantifying similarity between motifs , 2007, Genome Biology.

[10]  Robert R. Delongchamp,et al.  Genome-wide estimation of gender differences in the gene expression of human livers: Statistical design and analysis , 2005, BMC Bioinformatics.

[11]  Andreas Prlic,et al.  Sequence analysis , 2003 .

[12]  Sven Rahmann,et al.  Dynamic Programming Algorithms for Two Statistical Problems in Computational Biology , 2003, WABI.

[13]  Michael Q. Zhang,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btl662 Sequence analysis Computing exact P-values for DNA motifs , 2022 .

[14]  Daniel G. Brown,et al.  A Survey of Seeding for Sequence Alignment , 2007 .

[15]  Xin Chen,et al.  TRANSFAC: an integrated system for gene expression regulation , 2000, Nucleic Acids Res..

[16]  Esko Ukkonen,et al.  MOODS: fast search for position weight matrix matches in DNA sequences , 2009, Bioinform..

[17]  Rodger Staden,et al.  Methods for calculating the probabilities of finding patterns in sequences , 1989, Comput. Appl. Biosci..

[18]  Michael Beckstette,et al.  Fast index based algorithms and software for matching position specific scoring matrices , 2006, BMC Bioinformatics.

[19]  Martin Vingron,et al.  Natural similarity measures between position frequency matrices with an application to clustering , 2008, Bioinform..

[20]  Allen D. Delaney,et al.  Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing , 2007, Nature Methods.

[21]  Michael Q. Zhang,et al.  Similarity of position frequency matrices for transcription factor binding sites , 2005, Bioinform..

[22]  Jean-Stéphane Varré,et al.  Large Scale Matching for Position Weight Matrices , 2006, CPM.

[23]  Cole Trapnell,et al.  Optimizing data intensive GPGPU computations for DNA sequence alignment , 2009, Parallel Comput..

[24]  Amitabh Varshney,et al.  High-throughput sequence alignment using Graphics Processing Units , 2007, BMC Bioinformatics.

[25]  Bogdan Dorohonceanu,et al.  Accelerating Protein Classification Using Suffix Trees , 2000, ISMB.

[26]  Pedro Trancoso,et al.  Initial Experiences Porting a Bioinformatics Application to a Graphics Processor , 2005, Panhellenic Conference on Informatics.

[27]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[28]  Hélène Touzet,et al.  Predicting transcription factor binding sites using local over-representation and comparative genomics , 2006, BMC Bioinformatics.

[29]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[30]  Szymon M. Kielbasa,et al.  Measuring similarities between transcription factor binding sites , 2005, BMC Bioinformatics.

[31]  Jean-Stéphane Varré,et al.  Parallel Position Weight Matrices algorithms , 2011, Parallel Comput..

[32]  Giorgio Valle,et al.  CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment , 2008, BMC Bioinformatics.

[33]  Jean-Stéphane Varré,et al.  Efficient and accurate P-value computation for Position Weight Matrices , 2007, Algorithms for Molecular Biology.

[34]  Douglas L. Brutlag,et al.  Fast probabilistic analysis of sequence function using scoring matrices , 2000, Bioinform..

[35]  Jean-Michel Claverie,et al.  The statistical significance of nucleotide position-weight matrix matches , 1996, Comput. Appl. Biosci..

[36]  Weiguo Liu,et al.  GPU-ClustalW: Using Graphics Hardware to Accelerate Multiple Sequence Alignment , 2006, HiPC.

[37]  Jean-Stéphane Varré,et al.  Self-overlapping Occurrences and Knuth-Morris-Pratt Algorithm for Weighted Matching , 2009, LATA.

[38]  Alexander Zelikovsky,et al.  Bioinformatics Algorithms: Techniques and Applications , 2008 .

[39]  Matthew R. Pocock,et al.  The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.