GMAP and GSNAP for Genomic Sequence Alignment: Enhancements to Speed, Accuracy, and Functionality

The programs GMAP and GSNAP, for aligning RNA-Seq and DNA-Seq datasets to genomes, have evolved along with advances in biological methodology to handle longer reads, larger volumes of data, and new types of biological assays. The genomic representation has been improved to include linear genomes that can compare sequences using single-instruction multiple-data (SIMD) instructions, compressed genomic hash tables with fast access using SIMD instructions, handling of large genomes with more than four billion bp, and enhanced suffix arrays (ESAs) with novel data structures for fast access. Improvements to the algorithms have included a greedy match-and-extend algorithm using suffix arrays, segment chaining using genomic hash tables, diagonalization using segmental hash tables, and nucleotide-level dynamic programming procedures that use SIMD instructions and eliminate the need for F-loop calculations. Enhancements to the functionality of the programs include standardization of indel positions, handling of ambiguous splicing, clipping and merging of overlapping paired-end reads, and alignments to circular chromosomes and alternate scaffolds. The programs have been adapted for use in pipelines by integrating their usage into R/Bioconductor packages such as gmapR and HTSeqGenie, and these pipelines have facilitated the discovery of numerous biological phenomena.

[1]  Christopher B. Burge,et al.  Maximum Entropy Modeling of Short Sequence Motifs with Applications to RNA Splicing Signals , 2004, J. Comput. Biol..

[2]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[3]  Michael Farrar,et al.  Sequence analysis Striped Smith – Waterman speeds database searches six times over other SIMD implementations , 2007 .

[4]  Brian P. Brunk,et al.  Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM) , 2011, Bioinform..

[5]  S. Turner,et al.  Real-Time DNA Sequencing from Single Polymerase Molecules , 2009, Science.

[6]  G. Rubin,et al.  A computer program for aligning a cDNA sequence with a genomic DNA sequence. , 1998, Genome research.

[7]  Thomas D. Wu,et al.  The effects of hepatitis B virus integration into the genomes of hepatocellular carcinoma patients. , 2012, Genome research.

[8]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[9]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[10]  R. Lister,et al.  Finding the fifth base: genome-wide sequencing of cytosine methylation. , 2009, Genome research.

[11]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[12]  Yan Zhang,et al.  GeneHub-GEPIS: digital expression profiling for normal and cancer tissues based on an integrated gene database , 2007, Nucleic Acids Res..

[13]  V. Solovyev,et al.  Analysis of canonical and non-canonical splice sites in mammalian genomes. , 2000, Nucleic acids research.

[14]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[15]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[16]  Thomas D. Wu,et al.  A comprehensive transcriptional portrait of human cancer cell lines , 2014, Nature Biotechnology.

[17]  J. Harrow,et al.  Assessment of transcript reconstruction methods for RNA-seq , 2013, Nature Methods.

[18]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[19]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[20]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[21]  Robert Gentleman,et al.  Comprehensive genomic analysis identifies SOX2 as a frequently amplified gene in small-cell lung cancer , 2012, Nature Genetics.

[22]  Aleksandar Milosavljevic,et al.  Pash: efficient genome-scale sequence anchoring by Positional Hashing. , 2004, Genome research.

[23]  Andrzej Wozniak,et al.  Using video-oriented instructions to speed up sequence comparison , 1997, Comput. Appl. Biosci..

[24]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[25]  Thomas D. Wu,et al.  Genome and transcriptome sequencing of lung cancers reveal diverse mutational and splicing events , 2012, Genome research.

[26]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[27]  Leonid Boytsov,et al.  Decoding billions of integers per second through vectorization , 2012, Softw. Pract. Exp..

[28]  L. E. McDonald,et al.  A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[30]  John C. Marioni,et al.  Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data , 2009, Bioinform..

[31]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[32]  Thomas D. Wu,et al.  Deep RNA sequencing analysis of readthrough gene fusions in human prostate adenocarcinoma and reference samples , 2011, BMC Medical Genomics.

[33]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[34]  Scott B. Dewell,et al.  Transcriptome-wide Identification of RNA-Binding Protein and MicroRNA Target Sites by PAR-CLIP , 2010, Cell.

[35]  Serban Nacu,et al.  Fast and SNP-tolerant detection of complex variants and splicing in short reads , 2010, Bioinform..

[36]  Thomas D. Wu,et al.  GMAP: a genomic mapping and alignment program for mRNA and EST sequence , 2005, Bioinform..

[37]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[38]  Paul Shannon,et al.  VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants , 2014, Bioinform..

[39]  J. Harrow,et al.  Systematic evaluation of spliced alignment programs for RNA-seq data , 2013, Nature Methods.

[40]  J. Mullikin,et al.  SSAHA: a fast search method for large DNA databases. , 2001, Genome research.

[41]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[42]  G. Luikart,et al.  SNPs in ecology, evolution and conservation , 2004 .

[43]  Raphael Gottardo,et al.  Orchestrating high-throughput genomic analysis with Bioconductor , 2015, Nature Methods.

[44]  Torbjørn Rognes,et al.  Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors , 2000, Bioinform..

[45]  Melanie A. Huntley,et al.  Recurrent R-spondin fusions in colon cancer , 2012, Nature.