The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote

Read alignment is an ongoing challenge for the analysis of data from sequencing technologies. This article proposes an elegantly simple multi-seed strategy, called seed-and-vote, for mapping reads to a reference genome. The new strategy chooses the mapped genomic location for the read directly from the seeds. It uses a relatively large number of short seeds (called subreads) extracted from each read and allows all the seeds to vote on the optimal location. When the read length is <160 bp, overlapping subreads are used. More conventional alignment algorithms are then used to fill in detailed mismatch and indel information between the subreads that make up the winning voting block. The strategy is fast because the overall genomic location has already been chosen before the detailed alignment is done. It is sensitive because no individual subread is required to map exactly, nor are individual subreads constrained to map close by other subreads. It is accurate because the final location must be supported by several different subreads. The strategy extends easily to find exon junctions, by locating reads that contain sets of subreads mapping to different exons of the same gene. It scales up efficiently for longer reads.

[1]  Wei Shi,et al.  Global changes in the mammary epigenome are induced by hormonal cues and coordinated by Ezh2. , 2013, Cell reports.

[2]  Roderic Guigó,et al.  The GEM mapper: fast, accurate and versatile alignment by filtration , 2012, Nature Methods.

[3]  Knut Reinert,et al.  RazerS 3: Faster, fully sensitive read mapping , 2012, Bioinform..

[4]  Neil Moore,et al.  Lifestyle transitions in plant pathogenic Colletotrichum fungi deciphered by genome and transcriptome analyses , 2012, Nature Genetics.

[5]  Dario Strbenac,et al.  Copy-number-aware differential analysis of quantitative DNA sequencing data , 2012, Genome research.

[6]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[7]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[8]  Davis J. McCarthy,et al.  Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation , 2012, Nucleic acids research.

[9]  Zhenxin Zhang,et al.  A 14 bp indel variation in the NCX1 gene modulates the age at onset in late-onset Alzheimer’s disease , 2012, Journal of Neural Transmission.

[10]  Knut Reinert,et al.  STELLAR: fast and exact local alignments , 2011, BMC Bioinformatics.

[11]  Martha R. Stampfer,et al.  Epigenetic regulation of normal human mammary cell type-specific miRNAs. , 2011, Genome research.

[12]  Lucian Ilie,et al.  SHRiMP2: Sensitive yet Practical Short Read Mapping , 2011, Bioinform..

[13]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[14]  Ping Zhang,et al.  Integrated genome-wide chromatin occupancy and expression analyses identify key myeloid pro-differentiation transcription factors repressed by Myb , 2011, Nucleic acids research.

[15]  Wei-keng Liao,et al.  Anatomy of a hash-based long read sequence mapping algorithm for next generation DNA sequencing , 2011, Bioinform..

[16]  Knut Reinert,et al.  A novel and well-defined benchmarking method for second generation read mapping , 2011, BMC Bioinformatics.

[17]  Wei Shi,et al.  Optimizing the noise versus bias trade-off for Illumina whole genome expression BeadChips , 2010, Nucleic acids research.

[18]  Manuel Holtgrewe,et al.  Mason – A Read Simulator for Second Generation Sequencing Data , 2010 .

[19]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[20]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[21]  Derek Y. Chiang,et al.  MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery , 2010, Nucleic acids research.

[22]  Dominique Lavenier,et al.  GASSST: global alignment short sequence search tool , 2010, Bioinform..

[23]  B. Langmead,et al.  Cloud-scale RNA-sequencing differential expression analysis with Myrna , 2010, Genome Biology.

[24]  Thomas J. Hardcastle,et al.  baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data , 2010, BMC Bioinformatics.

[25]  Faraz Hach,et al.  mrsFAST: a cache-oblivious algorithm for short-read mapping , 2010, Nature Methods.

[26]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[27]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[28]  Serban Nacu,et al.  Fast and SNP-tolerant detection of complex variants and splicing in short reads , 2010, Bioinform..

[29]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[30]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[31]  S. Nelson,et al.  BFAST: An Alignment Tool for Large Scale Genome Resequencing , 2009, PloS one.

[32]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[33]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[34]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[35]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[36]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[37]  Bin Ma,et al.  ZOOM! Zillions of oligos mapped , 2008, Bioinform..

[38]  R. Irizarry,et al.  Consolidated strategy for the analysis of microarray spike-in data , 2008, Nucleic acids research.

[39]  Matthew E Ritchie,et al.  Spike-in validation of an Illumina-specific variance-stabilizing transformation , 2008, BMC Research Notes.

[40]  Mark D. Robinson,et al.  Moderated statistical tests for assessing differences in tag abundance , 2007, Bioinform..

[41]  Changqing Zeng,et al.  A six-nucleotide insertion-deletion polymorphism in the CASP8 promoter is associated with susceptibility to multiple cancers , 2007, Nature Genetics.

[42]  Hanlee P. Ji,et al.  The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. , 2006, Nature biotechnology.

[43]  Maqc Consortium The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements , 2006, Nature Biotechnology.

[44]  Kathleen F. Kerr,et al.  The External RNA Controls Consortium: a progress report , 2005, Nature Methods.

[45]  Eugene W. Myers,et al.  Efficient q-Gram Filters for Finding All epsilon-Matches over a Given Length , 2005, RECOMB.

[46]  Kim R. Rasmussen,et al.  Efficient q-Gram Filters for Finding All-Matches Over a Given Length , 2005 .

[47]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[48]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[49]  Martin Vingron,et al.  q-gram based database searching using a suffix array (QUASAR) , 1999, RECOMB.

[50]  Eugene W. Myers,et al.  A fast bit-vector algorithm for approximate string matching based on dynamic programming , 1998, JACM.

[51]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[52]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[53]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .