BGSA: a bit-parallel global sequence alignment toolkit for multi-core and many-core architectures

Motivation Modern bioinformatics tools for analyzing large-scale NGS datasets often need to include fast implementations of core sequence alignment algorithms in order to achieve reasonable execution times. We address this need by presenting the BGSA toolkit for optimized implementations of popular bit-parallel global pairwise alignment algorithms on modern microprocessors. Results BGSA outperforms Edlib, SeqAn, and BitPAl for pairwise edit distance computations and Parasail, SeqAn, and BitPAl when using more general scoring schemes for pairwise alignments of a batch of sequence reads on both standard multi-core CPUs and Xeon Phi many-core CPUs. Furthermore, banded edit distance performance of BGSA on a Xeon Phi-7210 outperforms the highly optimized NVBio implementation on a Titan X GPU for the seed verification stage of a read mapper by a factor of 4.4. Availability BGSA is open-source and available at https://github.com/sdu-hpcl/BGSA. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Knut Reinert,et al.  RazerS 3: Faster, fully sensitive read mapping , 2012, Bioinform..

[2]  Roderic Guigó,et al.  The GEM mapper: fast, accurate and versatile alignment by filtration , 2012, Nature Methods.

[3]  Weiguo Liu,et al.  Fast and efficient short read mapping based on a succinct hash index , 2018, BMC Bioinformatics.

[4]  Gary Benson,et al.  BitPAl: a bit-parallel, general integer-scoring sequence alignment algorithm , 2014, Bioinform..

[5]  Martin Sosic,et al.  Edlib: a C/C++ library for fast, exact sequence alignment using edit distance , 2016, bioRxiv.

[6]  Manuel Holtgrewe,et al.  Mason – A Read Simulator for Second Generation Sequencing Data , 2010 .

[7]  N. Jayaram,et al.  Evaluating tools for transcription factor binding site prediction , 2016, BMC Bioinformatics.

[8]  Jeff Daily,et al.  Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments , 2016, BMC Bioinformatics.

[9]  Niranjan Nagarajan,et al.  Fast and accurate de novo genome assembly from long uncorrected reads. , 2017, Genome research.

[10]  Knut Reinert,et al.  Generic accelerated sequence alignment in SeqAn using vectorization and multi‐threading , 2018, Bioinform..

[11]  Kiyoshi Asai,et al.  PBSIM: PacBio reads simulator - toward accurate genome assembly , 2013, Bioinform..

[12]  Heikki Hyyrö,et al.  A Bit-Vector Algorithm for Computing Levenshtein and Damerau Edit Distances , 2003, Nord. J. Comput..

[13]  Eugene W. Myers A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming , 1998, CPM.

[14]  Michael Farrar,et al.  Sequence analysis Striped Smith – Waterman speeds database searches six times over other SIMD implementations , 2007 .