论文信息 - PUNAS: A Parallel Ungapped-Alignment-Featured Seed Verification Algorithm for Next-Generation Sequencing Read Alignment

PUNAS: A Parallel Ungapped-Alignment-Featured Seed Verification Algorithm for Next-Generation Sequencing Read Alignment

The progress of next-generation sequencing has a major impact on medical and genomic research. This technology can now produce billions of short DNA fragments (reads) in a single run. One of the most demanding computational problems used by almost every sequencing pipeline is short-read alignment; i.e. determining where each fragment originated from in the original genome. Most current solutions are based on a seed-and-extend approach, where promising candidate regions (seeds) are first identified and subsequently extended in order to verify whether a full high-scoring alignment actually exists in the vicinity of each seed. Seed verification is the main bottleneck in many state-of-the-art aligners and thus finding fast solutions is of high importance. We present a parallel ungapped-alignment-featured seed verification (PUNAS) algorithm, a fast filter for effectively removing the majority of false positive seeds, thus significantly accelerating the short-read alignment process. PUNAS is based on bit-parallelism and takes advantage of SIMD vector units of modern microprocessors. Our implementation employs a vectorize-and-scale approach supporting multi-core CPUs and many-core Knights Landing (KNL)-based Xeon Phi processors. Performance evaluation reveals that PUNAS is over three orders-of-magnitude faster than seed verification with the Smith-Waterman algorithm and around one order-of-magnitude faster than seed verification with the banded version of Myers bit-vector algorithm. Using a single thread it achieves a speedup of up to 7.3, 27.1, and 11.6 compared to the shifted Hamming distance filter on a SSE, AVX2, and AVX-512 based CPU/KNL, respectively. The speed of our framework further scales almost linearly with the number of cores. PUNAS is open-source software available at https://github.com/Xu-Kai/PUNASfilter.

[1] Gabor T. Marth,et al. SSW Library: An SIMD Smith-Waterman C/C++ Library for Use in Genomic Applications , 2012, PloS one.

[2] Steven Salzberg,et al. Short Read Mapping: An Algorithmic Tour , 2017, Proceedings of the IEEE.

[3] Alexander S. Szalay,et al. Arioc: high-throughput read alignment with GPU-accelerated exploration of the seed-and-extend search space , 2015, PeerJ.

[4] Onur Mutlu,et al. GateKeeper: Enabling Fast Pre-Alignment in DNA Short Read Mapping with a New Streaming Accelerator Architecture , 2016, ArXiv.

[5] Knut Reinert,et al. Fast and accurate read mapping with approximate seeds and multiple backtracking , 2012, Nucleic acids research.

[6] M. Schatz,et al. Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[7] Manuel Holtgrewe,et al. Mason – A Read Simulator for Second Generation Sequencing Data , 2010 .

[8] Jun Wang,et al. MICA: A fast short-read aligner that takes full advantage of Many Integrated Core Architecture (MIC) , 2014, BMC Bioinformatics.

[9] Knut Reinert,et al. SeqAn An efficient, generic C++ library for sequence analysis , 2008, BMC Bioinformatics.

[10] Knut Reinert,et al. RazerS 3: Faster, fully sensitive read mapping , 2012, Bioinform..

[11] J. Kitzman,et al. Personalized Copy-Number and Segmental Duplication Maps using Next-Generation Sequencing , 2009, Nature Genetics.

[12] Enno Ohlebusch,et al. Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[13] Eugene W. Myers. A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming , 1998, CPM.

[14] Ernst Houtgast,et al. An FPGA-based systolic array to accelerate the BWA-MEM genomic mapping algorithm , 2015, 2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS).

[15] Paolo Ferragina,et al. Indexing compressed text , 2005, JACM.

[16] Roderic Guigó,et al. The GEM mapper: fast, accurate and versatile alignment by filtration , 2012, Nature Methods.

[17] Onur Mutlu,et al. Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping , 2015, Bioinform..

[18] Michael Farrar,et al. Sequence analysis Striped Smith – Waterman speeds database searches six times over other SIMD implementations , 2007 .

[19] Steven L Salzberg,et al. Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[20] Yongchao Liu,et al. CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform , 2012, Bioinform..

[21] M S Waterman,et al. Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[22] Gabor T. Marth,et al. MOSAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Mapping , 2013, PloS one.

[23] Onur Mutlu,et al. Accelerating read mapping with FastHASH , 2013, BMC Genomics.

[24] Knut Reinert,et al. Alignment of Next-Generation Sequencing Reads. , 2015, Annual review of genomics and human genetics.

[25] Tomás F. Pena,et al. BigBWA: approaching the Burrows-Wheeler aligner to Big Data technologies , 2015, Bioinform..

[26] Jorge González-Domínguez,et al. parSRA: A framework for the parallel execution of short read aligners on compute clusters , 2017, J. Comput. Sci..

[27] Yun Xu,et al. BitMapper: an efficient all-mapper based on bit-vector computing , 2015, BMC Bioinformatics.

[28] Yongchao Liu,et al. CUSHAW3: Sensitive and Accurate Base-Space and Color-Space Short-Read Alignment with Hybrid Seeding , 2014, PloS one.

[29] Leonid Oliker,et al. merAligner: A Fully Parallel Sequence Aligner , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[30] Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.