论文信息 - A fast CUDA implementation of agrep algorithm for approximate nucleotide sequence matching

A fast CUDA implementation of agrep algorithm for approximate nucleotide sequence matching

The availability of huge amounts of nucleotide sequences catalyzes the development of fast algorithms for approximate DNA and RNA string matching. However, most existing online algorithms can only handle small scale problems. When querying large genomes, their performance becomes unacceptable. Offline algorithms such as Bowtie and BWA require building indexes, and their memory requirement is high. We have developed a fast CUDA implementation of agrep algorithm for approximate nucleotide sequence matching by exploiting the huge computational power of modern GPU hardware. Our CUDA program is capable of searching large genomes for patterns of length up to 64 with edit distance up to 9. For example, it is able to search the entire human genome (3.10 Gbp in 24 chromosomes) for patterns of lengths of 30 and 60 with edit distances of 3 and 6 within 371 and 1,188 milliseconds respectively on one NVIDIA GeForce GTX285 graphics card, achieving 70-fold and 36-fold speedups over multithreaded QuadCore CPU counterpart. Our program employs online approach and does not require building indexes of any kind, it thus can be applied in real time. Using two-bits-for-one-character binary representation, its memory requirement is merely one fourth of the original genome size. Therefore it is possible to load multiple genomes simultaneously. The x86 and x64 executables for Linux and Windows, C++ source code, documentations, user manual, and an AJAX MVC website for online real time searching are available at http://agrep.cse.cuhk.edu.hk. Users can also send emails to CUDAagrepGmail.com to queue up for a job.

Kwong-Sak Leung | Man Hon Wong | Hongjian Li | Bing Ni

[1] Richard Durbin,et al. Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[2] Gonzalo Navarro,et al. On-line approximate string matching with bounded errors , 2008, Theor. Comput. Sci..

[3] 박근수,et al. Fast Matching Method for DNA Sequences , 2009 .

[4] Zvi Galil,et al. An Improved Algorithm for Approximate String Matching , 1989, SIAM J. Comput..

[5] Liqing Zhang,et al. GPU-RMAP: Accelerating Short-Read Mapping on Graphics Processors , 2010, 2010 13th IEEE International Conference on Computational Science and Engineering.

[6] Kwong-Sak Leung,et al. N-SAMSAM : A simple and faster algorithm for solving approximate matching in DNA sequences , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[7] TaeJin Ahn,et al. A fast algorithm for exact sequence search in biological sequences using polyphase decomposition , 2010, Bioinform..

[8] Cole Trapnell,et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[9] Konstantin Makarychev,et al. Serial and parallel methods for i/o efficient suffix tree construction , 2009, SIGMOD Conference.

[10] Udi Manber,et al. Fast text searching: allowing errors , 1992, CACM.