Long read alignment based on maximal exact match seeds

Motivation: The explosive growth of next-generation sequencing datasets poses a challenge to the mapping of reads to reference genomes in terms of alignment quality and execution speed. With the continuing progress of high-throughput sequencing technologies, read length is constantly increasing and many existing aligners are becoming inefficient as generated reads grow larger. Results: We present CUSHAW2, a parallelized, accurate, and memory-efficient long read aligner. Our aligner is based on the seed-and-extend approach and uses maximal exact matches as seeds to find gapped alignments. We have evaluated and compared CUSHAW2 to the three other long read aligners BWA-SW, Bowtie2 and GASSST, by aligning simulated and real datasets to the human genome. The performance evaluation shows that CUSHAW2 is consistently among the highest-ranked aligners in terms of alignment quality for both single-end and paired-end alignment, while demonstrating highly competitive speed. Furthermore, our aligner shows good parallel scalability with respect to the number of CPU threads. Availability: CUSHAW2, written in C++, and all simulated datasets are available at http://cushaw2.sourceforge.net Contact: liuy@uni-mainz.de; bertil.schmidt@uni-mainz.de Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Yongchao Liu,et al.  CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform , 2012, Bioinform..

[2]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[3]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[4]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[5]  Jens Stoye,et al.  Exact and complete short-read alignment to microbial genomes using Graphics Processing Unit programming , 2011, Bioinform..

[6]  Ruiqiang Li,et al.  SOAP: short oligonucleotide alignment program , 2008, Bioinform..

[7]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[8]  Nicholas L. Bray,et al.  AVID: A global alignment program. , 2003, Genome research.

[9]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[10]  Hwan-Gue Cho,et al.  GAME: A simple and efficient whole genome alignment method using maximal exact match filtering , 2005, Comput. Biol. Chem..

[11]  J. Mullikin,et al.  SSAHA: a fast search method for large DNA databases. , 2001, Genome research.

[12]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[13]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[14]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[15]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[16]  Enno Ohlebusch,et al.  Efficient multiple genome alignment , 2002, ISMB.

[17]  Michael Brudno,et al.  SHRiMP: Accurate Mapping of Short Color-space Reads , 2009, PLoS Comput. Biol..

[18]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[19]  Eugene W. Myers,et al.  Efficient q-Gram Filters for Finding All epsilon-Matches over a Given Length , 2005, RECOMB.

[20]  Siu-Ming Yiu,et al.  Compressed indexing and local alignment of DNA , 2008, Bioinform..

[21]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[22]  Torbjørn Rognes,et al.  Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation , 2011, BMC Bioinformatics.

[23]  Meng He,et al.  Indexing Compressed Text , 2003 .

[24]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[25]  Eugene W. Myers,et al.  Efficient q-Gram Filters for Finding All epsilon-Matches over a Given Length , 2006, J. Comput. Biol..

[26]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[27]  M. Frith,et al.  Adaptive seeds tame genomic sequence comparison. , 2011, Genome research.

[28]  Dominique Lavenier,et al.  GASSST: global alignment short sequence search tool , 2010, Bioinform..

[29]  S. Nelson,et al.  BFAST: An Alignment Tool for Large Scale Genome Resequencing , 2009, PloS one.