论文信息 - Parallel Position Weight Matrices Algorithms - 字舞流文

Parallel Position Weight Matrices Algorithms

Position Weight Matrices (PWMs) are broadly used in computational biology. The basic problem, SCAN, aims to find the occurrences of a given PWM in large sequences. Some other PWM tasks share a common NP-hard subproblem, SCOREDISTRIBUTION. The existing algorithms rely on the enumeration on a large set of scores or words, and they are mostly not suitable for parallelization.We propose a new algorithm, BUCKETSCOREDISTRIBUTION, that is both very efficient and suitable for parallelization.We bound the error induced by this algorithm. We realized a GPU prototype for SCAN and BUCKETSCOREDISTRIBUTION with the CUDA libraries, and report for the different problems speedups of 21x and 77x on a Nvidia GTX 280.

Jean-Stéphane Varré | Mathieu Giraud | Mathieu Giraud | J. Varré

[1] Hanlee P. Ji,et al. Next-generation DNA sequencing , 2008, Nature Biotechnology.

[2] David Thomas,et al. The Art in Computer Programming , 2001 .

[3] Esko Ukkonen,et al. Fast Search Algorithms for Position Specific Scoring Matrices , 2007, BIRD.

[4] Donald E. Knuth,et al. Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[5] Robert S. Boyer,et al. A fast string searching algorithm , 1977, CACM.

[6] Donald Ervin Knuth,et al. The Art of Computer Programming, Volume II: Seminumerical Algorithms , 1970 .

[7] Yongchao Liu,et al. CUDA-MEME: Accelerating motif discovery in biological sequences using CUDA-enabled graphics processing units , 2010, Pattern Recognit. Lett..

[8] Alessandro Bogliolo,et al. Using sequence compression to speedup probabilistic profile matching , 2005, Bioinform..

[9] William Stafford Noble,et al. Quantifying similarity between motifs , 2007, Genome Biology.

[10] Robert R. Delongchamp,et al. Genome-wide estimation of gender differences in the gene expression of human livers: Statistical design and analysis , 2005, BMC Bioinformatics.

[11] Andreas Prlic,et al. Sequence analysis , 2003 .

[12] Sven Rahmann,et al. Dynamic Programming Algorithms for Two Statistical Problems in Computational Biology , 2003, WABI.

[13] Michael Q. Zhang,et al. BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btl662 Sequence analysis Computing exact P-values for DNA motifs , 2022 .

[14] Daniel G. Brown,et al. A Survey of Seeding for Sequence Alignment , 2007 .

[15] Xin Chen,et al. TRANSFAC: an integrated system for gene expression regulation , 2000, Nucleic Acids Res..

[16] Esko Ukkonen,et al. MOODS: fast search for position weight matrix matches in DNA sequences , 2009, Bioinform..

[17] Rodger Staden,et al. Methods for calculating the probabilities of finding patterns in sequences , 1989, Comput. Appl. Biosci..

[18] Michael Beckstette,et al. Fast index based algorithms and software for matching position specific scoring matrices , 2006, BMC Bioinformatics.

[19] Martin Vingron,et al. Natural similarity measures between position frequency matrices with an application to clustering , 2008, Bioinform..

[20] Allen D. Delaney,et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing , 2007, Nature Methods.

[21] Michael Q. Zhang,et al. Similarity of position frequency matrices for transcription factor binding sites , 2005, Bioinform..

[22] Jean-Stéphane Varré,et al. Large Scale Matching for Position Weight Matrices , 2006, CPM.

[23] Cole Trapnell,et al. Optimizing data intensive GPGPU computations for DNA sequence alignment , 2009, Parallel Comput..

[24] Amitabh Varshney,et al. High-throughput sequence alignment using Graphics Processing Units , 2007, BMC Bioinformatics.

[25] Bogdan Dorohonceanu,et al. Accelerating Protein Classification Using Suffix Trees , 2000, ISMB.

[26] Pedro Trancoso,et al. Initial Experiences Porting a Bioinformatics Application to a Graphics Processor , 2005, Panhellenic Conference on Informatics.

[27] Alfred V. Aho,et al. Efficient string matching , 1975, Commun. ACM.

[28] Hélène Touzet,et al. Predicting transcription factor binding sites using local over-representation and comparative genomics , 2006, BMC Bioinformatics.

[29] Wyeth W. Wasserman,et al. JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[30] Szymon M. Kielbasa,et al. Measuring similarities between transcription factor binding sites , 2005, BMC Bioinformatics.

[31] Jean-Stéphane Varré,et al. Parallel Position Weight Matrices algorithms , 2011, Parallel Comput..

[32] Giorgio Valle,et al. CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment , 2008, BMC Bioinformatics.

[33] Jean-Stéphane Varré,et al. Efficient and accurate P-value computation for Position Weight Matrices , 2007, Algorithms for Molecular Biology.

[34] Douglas L. Brutlag,et al. Fast probabilistic analysis of sequence function using scoring matrices , 2000, Bioinform..

[35] Jean-Michel Claverie,et al. The statistical significance of nucleotide position-weight matrix matches , 1996, Comput. Appl. Biosci..

[36] Weiguo Liu,et al. GPU-ClustalW: Using Graphics Hardware to Accelerate Multiple Sequence Alignment , 2006, HiPC.

[37] Jean-Stéphane Varré,et al. Self-overlapping Occurrences and Knuth-Morris-Pratt Algorithm for Weighted Matching , 2009, LATA.

[38] Alexander Zelikovsky,et al. Bioinformatics Algorithms: Techniques and Applications , 2008 .

[39] Matthew R. Pocock,et al. The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.