A Review on Sequence Alignment Algorithms for Short Reads Based on Next-Generation Sequencing

With recent advances in next-generation sequencing (NGS) technology, large volumes of data have been produced in the form of short reads. Sequence assembly involves using initial short reads to produce progressively longer contigs, and then using scaffolds to produce the final sequence. These processes each require evaluation of the extent of homology between different sequences. However, because the NGS platforms currently being developed are diverse, and the data being produced are of different sizes and read lengths, numerous algorithms are being developed with unique methodologies to process this complex data. It is difficult for biologists to manipulate the different features involved in these algorithms. Therefore, to reduce experimental trial-and-error, different strategies are required depending on the performance and purpose of the optimal algorithm, thereby facilitating understanding of algorithm methodologies and effective use of their various features. This study is a review of the different short read alignment algorithms and NGS platforms that have been developed to date, in order to aid efficient selection of algorithms for reference sequences and mapping of DNA data.

[1]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[2]  Konrad H. Paszkiewicz,et al.  De novo assembly of short sequence reads , 2010, Briefings Bioinform..

[3]  F. Blattner,et al.  Mauve: multiple alignment of conserved genomic sequence with rearrangements. , 2004, Genome research.

[4]  Mark Johnson,et al.  NCBI BLAST: a better web interface , 2008, Nucleic Acids Res..

[5]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[6]  E. Mardis Next-generation DNA sequencing methods. , 2008, Annual review of genomics and human genetics.

[7]  Gangman Yi,et al.  geneCo: a visualized comparative genomic method to analyze multiple genome structures , 2019, Bioinform..

[8]  Meng Zhang,et al.  The next-generation sequencing technology: A technology review and future perspective , 2010, Science China Life Sciences.

[9]  Jens Stoye,et al.  Exact and complete short-read alignment to microbial genomes using Graphics Processing Unit programming , 2011, Bioinform..

[10]  D. Bentley,et al.  Whole-genome re-sequencing. , 2006, Current opinion in genetics & development.

[11]  Caroline Lieber,et al.  Understanding the Basics of NGS: From Mechanism to Variant Calling , 2015, Current Genetic Medicine Reports.

[12]  Martin Goodson,et al.  Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. , 2011, Genome research.

[13]  Cole Trapnell,et al.  How to map billions of short reads onto genomes , 2009, Nature Biotechnology.

[14]  Graham Pullan,et al.  BarraCUDA - a fast short read sequence aligner using graphics processing units , 2011, BMC Research Notes.

[15]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[16]  Alexander S. Szalay,et al.  Arioc: high-throughput read alignment with GPU-accelerated exploration of the seed-and-extend search space , 2015, PeerJ.

[17]  W. Shi,et al.  The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote , 2013, Nucleic acids research.

[18]  Gabor T. Marth,et al.  MOSAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Mapping , 2013, PloS one.

[19]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[20]  Chengxi Ye,et al.  DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies , 2014, Scientific Reports.

[21]  Juha Kärkkäinen,et al.  Better Filtering with Gapped q-Grams , 2001, Fundam. Informaticae.

[22]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[23]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[24]  Bin Ma,et al.  Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[25]  Matthew Berriman,et al.  ACT: the Artemis comparison tool , 2005, Bioinform..

[26]  Roderic Guigó,et al.  The GEM mapper: fast, accurate and versatile alignment by filtration , 2012, Nature Methods.

[27]  C. Shin,et al.  Survey of the Applications of NGS to Whole-Genome Sequencing and Expression Profiling , 2012, Genomics & informatics.

[28]  Y. Xing,et al.  Detection of splice junctions from paired-end RNA-seq data by SpliceMap , 2010, Nucleic acids research.

[29]  Veli Mäkinen,et al.  Unified View of Backward Backtracking in Short Read Mapping , 2010, Algorithms and Applications.

[30]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[31]  Liqing Zhang,et al.  GPU-RMAP: Accelerating Short-Read Mapping on Graphics Processors , 2010, 2010 13th IEEE International Conference on Computational Science and Engineering.

[32]  Yongchao Liu,et al.  CUSHAW2-GPU: Empowering Faster Gapped Short-Read Alignment Using GPU Computing , 2014, IEEE Design & Test.

[33]  Michael C. Schatz,et al.  Teaser: Individualized benchmarking and optimization of read mapping results for NGS data , 2015, bioRxiv.

[34]  Meng He,et al.  Indexing Compressed Text , 2003 .

[35]  Michael Q. Zhang,et al.  Using quality scores and longer reads improves accuracy of Solexa read mapping , 2008, BMC Bioinformatics.

[36]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[37]  Dmitriy A. Khodakov,et al.  Diagnostics based on nucleic acid sequence variant profiling: PCR, hybridization, and NGS approaches. , 2016, Advanced drug delivery reviews.

[38]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[39]  Wing Hung Wong,et al.  SeqMap: mapping massive amount of oligonucleotides to the genome , 2008, Bioinform..

[40]  M. Mielczarek,et al.  Review of alignment and SNP calling algorithms for next-generation sequencing data , 2015, Journal of Applied Genetics.

[41]  Erika Check Hayden,et al.  Genome sequencing: the third generation , 2009, Nature.

[42]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[43]  Nuno A. Fonseca,et al.  Tools for mapping high-throughput sequencing data , 2012, Bioinform..

[44]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[45]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[46]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[47]  Onur Mutlu,et al.  GateKeeper: Enabling Fast Pre-Alignment in DNA Short Read Mapping with a New Streaming Accelerator Architecture , 2016, ArXiv.

[48]  Ricardo A. Baeza-Yates,et al.  Fast and Practical Approximate String Matching , 1992, Inf. Process. Lett..

[49]  B. Thyagarajan,et al.  Review of Clinical Next-Generation Sequencing. , 2017, Archives of pathology & laboratory medicine.

[50]  S. Nelson,et al.  BFAST: An Alignment Tool for Large Scale Genome Resequencing , 2009, PloS one.

[51]  G. Kucherov,et al.  Multiseed lossless filtration , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[52]  Yongchao Liu,et al.  CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform , 2012, Bioinform..

[53]  Ruiqiang Li,et al.  SOAP: short oligonucleotide alignment program , 2008, Bioinform..

[54]  Siu-Ming Yiu,et al.  SOAP3: ultra-fast GPU-based parallel alignment tool for short reads , 2012, Bioinform..

[55]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[56]  Pavel A. Pevzner,et al.  Multiple filtration and approximate pattern matching , 1995, Algorithmica.

[57]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[58]  Stephen F. Altschul,et al.  Evaluating the Statistical Significance of Multiple Distinct Local Alignments , 1997 .

[59]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[60]  Glenn Tesler,et al.  Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory , 2012, BMC Bioinformatics.

[61]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[62]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[63]  S. Karlin,et al.  Applications and statistics for multiple high-scoring segments in molecular sequences. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[64]  Nicola K. Petty,et al.  BLAST Ring Image Generator (BRIG): simple prokaryote genome comparisons , 2011, BMC Genomics.

[65]  D. Lipman,et al.  Rapid similarity searches of nucleic acid and protein data banks. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[66]  Michael Brudno,et al.  SHRiMP: Accurate Mapping of Short Color-space Reads , 2009, PLoS Comput. Biol..

[67]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[68]  Waqar Haque,et al.  Pairwise sequence alignment algorithms: a survey , 2009 .

[69]  Michael Q. Zhang,et al.  Updates to the RMAP short-read mapping software , 2009, Bioinform..

[70]  Xue Gao,et al.  Impact of next-generation sequencing on molecular diagnosis of inherited non-syndromic hearing loss , 2014 .

[71]  Miklós Ajtai,et al.  The complexity of the Pigeonhole Principle , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[72]  T. Glenn Field guide to next‐generation DNA sequencers , 2011, Molecular ecology resources.

[73]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[74]  Gonzalo Navarro,et al.  Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences , 2002 .

[75]  Alair Pereira do Lago,et al.  Lossless filter for multiple repeats with bounded edit distance , 2008, Algorithms for Molecular Biology.

[76]  Bin Ma,et al.  ZOOM! Zillions of oligos mapped , 2008, Bioinform..

[77]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[78]  S. Schuster Next-generation sequencing transforms today's biology , 2008, Nature Methods.

[79]  Stuart Dreyfus,et al.  Richard Bellman on the Birth of Dynamic Programming , 2002, Oper. Res..

[80]  Onur Mutlu,et al.  Accelerating read mapping with FastHASH , 2013, BMC Genomics.