Acceleration of Nucleotide Semi-Global Alignment with Adaptive Banded Dynamic Programming

Motivation Pairwise alignment of nucleotide sequences has been calculated in practice by the seed-and-extend strategy, where we enumerate seeds (shared patterns) between sequences and then extend the seeds by a Smith-Waterman-like semi-global dynamic programming to obtain full pairwise alignments. With the advent of massively parallel short read sequencers, algorithms and data structures for efficiently finding seeds had been explored extensively. However, recent advances in single-molecule sequencing technologies enabled us to obtain millions of reads, each of which is orders of magnitude longer than those output by the short-read sequencers, demanding a faster algorithm for the extension step that dominates the computation time in pairwise local alignment. Our goal is to design a faster extension algorithm which overcomes the two major drawbacks of the single-molecule sequencers that the sequencing error rates is high (e.g., 10-15 %) and insertions and deletions are more frequent than substitutions are. Results We propose an adaptive banded dynamic programming (DP) algorithm for calculating pairwise semi-global alignment of nucleotide sequences that allows a relatively high insertion or deletion rate while maintaining the band width to some small constant (e.g., 32 cells). On every band advancing operation, cells at the forefront of the band are calculated simultaneously without mutual dependencies, allowing an efficient Single-Instruction-Multiple-Data (SIMD) parallelization. We show by an experiment that our algorithm runs approximately 8 times faster than the extension alignment algorithm in NCBI BLAST+ retaining the similar sensitivity and accuracy. The results indicate that the algorithm is capable of replacing extension alignment routines in the existing nucleotide local alignment programs. Availability The implementation of the algorithm and the benchmarking scripts are available at https://github.com/ocxtal/adaptivebandbench. Contact mkasa@k.u-tokyo.ac.jp

[1]  Knut Reinert,et al.  SeqAn An efficient, generic C++ library for sequence analysis , 2008, BMC Bioinformatics.

[2]  J. Zook,et al.  Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls , 2013, Nature Biotechnology.

[3]  R. Durbin,et al.  Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly , 2016, bioRxiv.

[4]  Wen-Lian Hsu,et al.  Kart: a divide-and-conquer algorithm for NGS read alignment , 2017, Bioinform..

[5]  Kenta Nakai,et al.  A Bit-Parallel Dynamic Programming Algorithm Suitable for DNA Sequence Alignment , 2012, J. Bioinform. Comput. Biol..

[6]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[7]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[8]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[9]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[10]  Piotr Indyk,et al.  Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false) , 2014, STOC.

[11]  Fatih Ozsolak,et al.  RNA sequencing: advances, challenges and opportunities , 2011, Nature Reviews Genetics.

[12]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[13]  Glenn Tesler,et al.  Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory , 2012, BMC Bioinformatics.

[14]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[15]  Gabor T. Marth,et al.  SSW Library: An SIMD Smith-Waterman C/C++ Library for Use in Genomic Applications , 2012, PloS one.

[16]  Niranjan Nagarajan,et al.  Fast and sensitive mapping of nanopore sequencing reads with GraphMap , 2016, Nature Communications.

[17]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[18]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[19]  Jeff Daily,et al.  Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments , 2016, BMC Bioinformatics.

[20]  Ricardo A. Baeza-Yates,et al.  A Practical q -Gram Index for Text Retrieval Allowing Errors , 2018, CLEI Electron. J..

[21]  S. Koren,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, bioRxiv.

[22]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[23]  Michael Farrar,et al.  Sequence analysis Striped Smith – Waterman speeds database searches six times over other SIMD implementations , 2007 .

[24]  David Haussler,et al.  Long-read sequence assembly of the gorilla genome , 2016, Science.

[25]  Eugene W. Myers,et al.  Efficient Local Alignment Discovery amongst Noisy Long Reads , 2014, WABI.

[26]  S. Nelson,et al.  BFAST: An Alignment Tool for Large Scale Genome Resequencing , 2009, PloS one.

[27]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[28]  Michael R. Speicher,et al.  A survey of tools for variant analysis of next-generation genome sequencing data , 2013, Briefings Bioinform..

[29]  Brent S. Pedersen,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, Nature Biotechnology.

[30]  Torbjørn Rognes,et al.  Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors , 2000, Bioinform..

[31]  M. Frith,et al.  Adaptive seeds tame genomic sequence comparison. , 2011, Genome research.

[32]  Kiyoshi Asai,et al.  PBSIM: PacBio reads simulator - toward accurate genome assembly , 2013, Bioinform..

[33]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[34]  Eugene W. Myers,et al.  A fast bit-vector algorithm for approximate string matching based on dynamic programming , 1998, JACM.

[35]  Heng Li,et al.  Minimap2: fast pairwise alignment for long DNA sequences , 2017 .

[36]  Steven Skiena,et al.  NanoBLASTer: Fast alignment and characterization of Oxford Nanopore single molecule sequencing reads , 2016, 2016 IEEE 6th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS).

[37]  Alexandros Stamatakis,et al.  Are all global alignment algorithms and implementations correct? , 2015, bioRxiv.

[38]  Weiguo Liu,et al.  XSW: Accelerating Biological Database Search on Xeon Phi , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[39]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[40]  M S Waterman,et al.  Sequence alignment and penalty choice. Review of concepts, case studies and implications. , 1994, Journal of molecular biology.

[41]  Gabor T. Marth,et al.  MOSAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Mapping , 2013, PloS one.

[42]  Kun-Mao Chao,et al.  Aligning two sequences within a specified diagonal band , 1992, Comput. Appl. Biosci..

[43]  Andrzej Wozniak,et al.  Using video-oriented instructions to speed up sequence comparison , 1997, Comput. Appl. Biosci..