Libgapmis: An ultrafast library for short-read single-gap alignment

A broad variety of short-read alignment programmes has been released recently to address the task of mapping tens of millions of short reads to a reference genome, placing emphasis on various aspects of the problem. Although all programmes allow for a small number of alignment mismatches, some of them either perform poorly when allowing gap insertions or they do not allow for gap insertions at all. The seed-and-extend strategy is applied in most of these programmes: after a fast alignment between a fragment of the reference sequence and a high-quality fragment of a short read-the seed-an important problem is to extend the alignment between a relatively short succeeding fragment of the reference sequence and the remaining low-quality fragment of the read allowing a number of mismatches and the insertion of gaps in the alignment. However, the length of the short reads in combination with the gap occurrence frequency observed in various applications suggest that the single-gap alignment of (parts of) those reads is desirable. In this article, we present libgapmis, an ultrafast library for pairwise short-read single-gap alignment including accelerated SSE-based and GPU-based versions. It implements an algorithm, which computes a modified version of the traditional dynamic programming matrix for sequence alignment to solve the above alignment problem. We show that the library functions of the CPU-based version are up to 20x faster compared to competing programmes, while the respective SSE-based and GPU-based versions are up to 6x and llx faster than our CPU-based implementation, respectively. The functions made available via our library can be seamlessly integrated into any short-read alignment pipeline.

[1]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[2]  Sahar Mansour,et al.  Rapid identification of mutations in GJC2 in primary lymphoedema using whole exome sequencing combined with linkage analysis with delineation of the phenotype , 2011, Journal of Medical Genetics.

[3]  Yongchao Liu,et al.  CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units , 2009, BMC Research Notes.

[4]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[5]  Alexandros Stamatakis,et al.  Coupling SIMD and SIMT architectures to boost performance of a phylogeny-aware alignment kernel , 2011, BMC Bioinformatics.

[6]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[7]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[8]  Costas S. Iliopoulos,et al.  Approximate string-matching with a single gap for sequence alignment , 2011, BCB '11.

[9]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[10]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[11]  S. Robertson,et al.  Mutations in NOTCH2 cause Hajdu-Cheney syndrome, a disorder of severe and progressive bone loss , 2011, Nature Genetics.

[12]  Emily H Turner,et al.  Targeted Capture and Massively Parallel Sequencing of Twelve Human Exomes , 2009, Nature.

[13]  Torbjørn Rognes,et al.  Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation , 2011, BMC Bioinformatics.

[14]  Costas S. Iliopoulos,et al.  REAL: an efficient REad ALigner for next generation sequencing reads , 2010, BCB '10.