Benchmarking short sequence mapping tools

BackgroundThe development of next-generation sequencing instruments has led to the generation of millions of short sequences in a single run. The process of aligning these reads to a reference genome is time consuming and demands the development of fast and accurate alignment tools. However, the current proposed tools make different compromises between the accuracy and the speed of mapping. Moreover, many important aspects are overlooked while comparing the performance of a newly developed tool to the state of the art. Therefore, there is a need for an objective evaluation method that covers all the aspects. In this work, we introduce a benchmarking suite to extensively analyze sequencing tools with respect to various aspects and provide an objective comparison.ResultsWe applied our benchmarking tests on 9 well known mapping tools, namely, Bowtie, Bowtie2, BWA, SOAP2, MAQ, RMAP, GSNAP, Novoalign, and mrsFAST (mrFAST) using synthetic data and real RNA-Seq data. MAQ and RMAP are based on building hash tables for the reads, whereas the remaining tools are based on indexing the reference genome. The benchmarking tests reveal the strengths and weaknesses of each tool. The results show that no single tool outperforms all others in all metrics. However, Bowtie maintained the best throughput for most of the tests while BWA performed better for longer read lengths. The benchmarking tests are not restricted to the mentioned tools and can be further applied to others.ConclusionThe mapping process is still a hard problem that is affected by many factors. In this work, we provided a benchmarking suite that reveals and evaluates the different factors affecting the mapping process. Still, there is no tool that outperforms all of the others in all the tests. Therefore, the end user should clearly specify his needs in order to choose the tool that provides the best results.

[1]  Steven J. M. Jones,et al.  Slider—maximum use of probability information for alignment of short sequence reads and SNP detection , 2008, Bioinform..

[2]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[3]  Michael Q. Zhang,et al.  Using quality scores and longer reads improves accuracy of Solexa read mapping , 2008, BMC Bioinformatics.

[4]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[5]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[6]  Serban Nacu,et al.  Fast and SNP-tolerant detection of complex variants and splicing in short reads , 2010, Bioinform..

[7]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[8]  Michael Brudno,et al.  SHRiMP: Accurate Mapping of Short Color-space Reads , 2009, PLoS Comput. Biol..

[9]  Knut Reinert,et al.  RazerS 3: Faster, fully sensitive read mapping , 2012, Bioinform..

[10]  Alok N. Choudhary,et al.  FANGS: high speed sequence mapping for next generation sequencers , 2010, SAC '10.

[11]  Brian D. Ondov,et al.  Efficient mapping of Applied Biosystems SOLiD sequence data to a reference genome for functional genomic applications , 2008, Bioinform..

[12]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[13]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[14]  Bin Ma,et al.  ZOOM! Zillions of oligos mapped , 2008, Bioinform..

[15]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[16]  Huanming Yang,et al.  SNP detection for massively parallel whole-genome resequencing. , 2009, Genome research.

[17]  Steven J. M. Jones,et al.  High quality SNP calling using Illumina data at shallow coverage , 2010, Bioinform..

[18]  Nuno A. Fonseca,et al.  Tools for mapping high-throughput sequencing data , 2012, Bioinform..

[19]  Paul Flicek,et al.  Sense from sequence reads: methods for alignment and assembly , 2009, Nature Methods.

[20]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[21]  Jan Schröder,et al.  Genome analysis SHREC : a short-read error correction method , 2009 .

[22]  Véronique Martin,et al.  Mapping Reads on a Genomic Sequence: An Algorithmic Overview and a Practical Comparative Analysis , 2012, J. Comput. Biol..

[23]  Knut Reinert,et al.  A novel and well-defined benchmarking method for second generation read mapping , 2011, BMC Bioinformatics.

[24]  K. Reinert,et al.  RazerS--fast read mapping with sensitivity control. , 2009, Genome research.

[25]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[26]  J. Kitzman,et al.  Personalized Copy-Number and Segmental Duplication Maps using Next-Generation Sequencing , 2009, Nature Genetics.

[27]  Ümit V. Çatalyürek,et al.  Benchmarking Short Sequence Mapping Tools , 2011, BIBM.

[28]  Manuel Holtgrewe,et al.  Mason – A Read Simulator for Second Generation Sequencing Data , 2010 .

[29]  Siu-Ming Yiu,et al.  SOAP3: ultra-fast GPU-based parallel alignment tool for short reads , 2012, Bioinform..

[30]  Giorgio Valle,et al.  PASS: a program to align short sequences , 2009, Bioinform..

[31]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[32]  T. Thomas,et al.  GemSIM: general, error-model based simulator of next-generation sequencing data , 2012, BMC Genomics.

[33]  Marcel H. Schulz,et al.  A Global View of Gene Activity and Alternative Splicing by Deep Sequencing of the Human Transcriptome , 2008, Science.

[34]  S. Nelson,et al.  BFAST: An Alignment Tool for Large Scale Genome Resequencing , 2009, PloS one.

[35]  M. Long,et al.  Intron-exon structures of eukaryotic model organisms. , 1999, Nucleic acids research.

[36]  Jens Stoye,et al.  Exact and complete short-read alignment to microbial genomes using Graphics Processing Unit programming , 2011, Bioinform..

[37]  Faraz Hach,et al.  mrsFAST: a cache-oblivious algorithm for short-read mapping , 2010, Nature Methods.

[38]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[39]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[40]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[41]  Matthew Ruffalo,et al.  Comparative analysis of algorithms for next-generation sequencing read alignment , 2011, Bioinform..

[42]  Robert D Schnabel,et al.  SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries , 2008, Nature Methods.

[43]  S. Nelson,et al.  Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning , 2008, Nature.

[44]  Rui Jiang,et al.  Evaluation of next-generation sequencing software in mapping and assembly , 2011, Journal of Human Genetics.