Introducing difference recurrence relations for faster semi-global alignment of long sequences

BackgroundThe read length of single-molecule DNA sequencers is reaching 1 Mb. Popular alignment software tools widely used for analyzing such long reads often take advantage of single-instruction multiple-data (SIMD) operations to accelerate calculation of dynamic programming (DP) matrices in the Smith–Waterman–Gotoh (SWG) algorithm with a fixed alignment start position at the origin. Nonetheless, 16-bit or 32-bit integers are necessary for storing the values in a DP matrix when sequences to be aligned are long; this situation hampers the use of the full SIMD width of modern processors.ResultsWe proposed a faster semi-global alignment algorithm, “difference recurrence relations,” that runs more rapidly than the state-of-the-art algorithm by a factor of 2.1. Instead of calculating and storing all the values in a DP matrix directly, our algorithm computes and stores mainly the differences between the values of adjacent cells in the matrix. Although the SWG algorithm and our algorithm can output exactly the same result, our algorithm mainly involves 8-bit integer operations, enabling us to exploit the full width of SIMD operations (e.g., 32) on modern processors. We also developed a library, libgaba, so that developers can easily integrate our algorithm into alignment programs.ConclusionsOur novel algorithm and optimized library implementation will facilitate accelerating nucleotide long-read analysis algorithms that use pairwise alignment stages. The library is implemented in the C programming language and available at https://github.com/ocxtal/libgaba.

[1]  Glenn Tesler,et al.  Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory , 2012, BMC Bioinformatics.

[2]  Trevor I. Dix,et al.  A Bit-String Longest-Common-Subsequence Algorithm , 1986, Inf. Process. Lett..

[3]  J. Landolin,et al.  Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing , 2014 .

[4]  Brent S. Pedersen,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, Nature Biotechnology.

[5]  Eugene W. Myers,et al.  Efficient Local Alignment Discovery amongst Noisy Long Reads , 2014, WABI.

[6]  Maxime Crochemore,et al.  A fast and practical bit-vector algorithm for the Longest Common Subsequence problem , 2001, Inf. Process. Lett..

[7]  Torbjørn Rognes,et al.  Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors , 2000, Bioinform..

[8]  Jeff Daily,et al.  Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments , 2016, BMC Bioinformatics.

[9]  Robert D. Finn,et al.  HMMER web server: interactive sequence similarity searching , 2011, Nucleic Acids Res..

[10]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[11]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[12]  M. Frith,et al.  Adaptive seeds tame genomic sequence comparison. , 2011, Genome research.

[13]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[14]  Gary Benson,et al.  BitPAl: a bit-parallel, general integer-scoring sequence alignment algorithm , 2014, Bioinform..

[15]  Gabor T. Marth,et al.  SSW Library: An SIMD Smith-Waterman C/C++ Library for Use in Genomic Applications , 2012, PloS one.

[16]  Masahiro Kasahara,et al.  Acceleration of Nucleotide Semi-Global Alignment with Adaptive Banded Dynamic Programming , 2017, bioRxiv.

[17]  Alexandros Stamatakis,et al.  Are all global alignment algorithms and implementations correct? , 2015, bioRxiv.

[18]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[19]  Kin Fai Au,et al.  IDP-ASE: haplotyping and quantifying allele-specific expression at the gene and gene isoform level by hybrid sequencing , 2016, Nucleic acids research.

[20]  Jonas Korlach,et al.  Discovery and genotyping of structural variation from long-read haploid genome sequence data , 2017, Genome research.

[21]  Knut Reinert,et al.  SeqAn An efficient, generic C++ library for sequence analysis , 2008, BMC Bioinformatics.

[22]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[23]  Michael Farrar,et al.  Sequence analysis Striped Smith – Waterman speeds database searches six times over other SIMD implementations , 2007 .

[24]  Kenta Nakai,et al.  A Bit-Parallel Dynamic Programming Algorithm Suitable for DNA Sequence Alignment , 2012, J. Bioinform. Comput. Biol..

[25]  Amin Mohammad Ruhul,et al.  NanoBLASTer: Fast alignment and characterization of Oxford Nanopore single molecule sequencing reads , 2016 .

[26]  Niranjan Nagarajan,et al.  Fast and sensitive mapping of nanopore sequencing reads with GraphMap , 2016, Nature Communications.

[27]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[28]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[29]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[30]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[31]  Kun-Mao Chao,et al.  Aligning two sequences within a specified diagonal band , 1992, Comput. Appl. Biosci..

[32]  Heng Li,et al.  Minimap2: versatile pairwise alignment for nucleotide sequences , 2017 .

[33]  Andrzej Wozniak,et al.  Using video-oriented instructions to speed up sequence comparison , 1997, Comput. Appl. Biosci..

[34]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[35]  Adam C. English,et al.  PBHoney: identifying genomic variants via long-read discordance and interrupted mapping , 2014, BMC Bioinformatics.

[36]  Eugene W. Myers A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming , 1998, CPM.