Next-Generation Sequence Assembly: Four Stages of Data Processing and Computational Challenges

Decoding DNA symbols using next-generation sequencers was a major breakthrough in genomic research. Despite the many advantages of next-generation sequencers, e.g., the high-throughput sequencing rate and relatively low cost of sequencing, the assembly of the reads produced by these sequencers still remains a major challenge. In this review, we address the basic framework of next-generation genome sequence assemblers, which comprises four basic stages: preprocessing filtering, a graph construction process, a graph simplification process, and postprocessing filtering. Here we discuss them as a framework of four stages for data analysis and processing and survey variety of techniques, algorithms, and software tools used during each stage. We also discuss the challenges that face current assemblers in the next-generation environment to determine the current state-of-the-art. We recommend a layered architecture approach for constructing a general assembler that can handle the sequences generated by different sequencing platforms.

[1]  Thomas C. Conway,et al.  Succinct data structures for assembling large genomes , 2010, Bioinform..

[2]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[3]  Eugene W. Myers,et al.  Computability of Models for Sequence Assembly , 2007, WABI.

[4]  Todd H. Oakley,et al.  The Ecoresponsive Genome of Daphnia pulex , 2011, Science.

[5]  J. Poulain,et al.  High quality draft sequences for prokaryotic genomes using a mix of new sequencing technologies , 2008, BMC Genomics.

[6]  Zhong Wang,et al.  Next-generation transcriptome assembly , 2011, Nature Reviews Genetics.

[7]  Leena Salmela,et al.  Correction of sequencing errors in a mixed set of reads , 2010, Bioinform..

[8]  A. Gnirke,et al.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[9]  Keith Bradnam,et al.  Assessing the gene space in draft genomes , 2008, Nucleic acids research.

[10]  Manolis Kellis,et al.  Error and Error Mitigation in Low-Coverage Genome Assemblies , 2011, PloS one.

[11]  C. Ponting,et al.  Genome assembly quality: assessment and improvement using the neutral indel model. , 2010, Genome research.

[12]  Xuan Li,et al.  Optimizing hybrid assembly of next-generation sequence data from Enterococcus faecium: a microbe with highly divergent genome , 2012, BMC Systems Biology.

[13]  Ian Sommerville,et al.  Software engineering (5th ed.) , 1995 .

[14]  Jan Schröder,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .

[15]  Yude Yu,et al.  The next-generation sequencing technology and application , 2010, Protein & Cell.

[16]  Andreas Tauch,et al.  Rapid hybrid de novo assembly of a microbial genome using only short reads: Corynebacterium pseudotuberculosis I19 as a case study. , 2011, Journal of microbiological methods.

[17]  Heng Li,et al.  Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly , 2012, Bioinform..

[18]  Nuno A. Fonseca,et al.  Assemblathon 1: a competitive assessment of de novo short read assembly methods. , 2011, Genome research.

[19]  M. Pop,et al.  Sequence assembly demystified , 2013, Nature Reviews Genetics.

[20]  James A. Cuff,et al.  Genome sequence, comparative analysis and haplotype structure of the domestic dog , 2005, Nature.

[21]  Stephen M. Mount,et al.  The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus) , 2008, Nature.

[22]  K. Voelkerding,et al.  Next-generation sequencing: from basic research to diagnostics. , 2009, Clinical chemistry.

[23]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[24]  Paul Medvedev,et al.  Ab Initio Whole Genome Shotgun Assembly with Mated Short Reads , 2008, RECOMB.

[25]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[26]  Anton Nekrutenko,et al.  NGS analyses by visualization with Trackster , 2012, Nature Biotechnology.

[27]  Wing-Kin Sung,et al.  Opera: Reconstructing Optimal Genomic Scaffolds with High-Throughput Paired-End Sequences , 2011, RECOMB.

[28]  Adam M. Phillippy,et al.  Comparative genome assembly , 2004, Briefings Bioinform..

[29]  Bairong Shen,et al.  A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies , 2011, PloS one.

[30]  Jan Schröder,et al.  Genome analysis SHREC : a short-read error correction method , 2009 .

[31]  Jared T. Simpson,et al.  Efficient construction of an assembly string graph using the FM-index , 2010, Bioinform..

[32]  Nan Li,et al.  Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph. , 2012, Briefings in functional genomics.

[33]  Peter A. Meric,et al.  Lineage-Specific Biology Revealed by a Finished Genome Assembly of the Mouse , 2009, PLoS biology.

[34]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[35]  Andrew H. Chan,et al.  ECHO: a reference-free short-read error correction algorithm. , 2011, Genome research.

[36]  Lore Cloots,et al.  Query-based biclustering of gene expression data using Probabilistic Relational Models , 2011, BMC Bioinformatics.

[37]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[38]  T. Dallman,et al.  Performance comparison of benchtop high-throughput sequencing platforms , 2012, Nature Biotechnology.

[39]  Matthew B. Kerby,et al.  Landscape of next-generation sequencing technologies. , 2011, Analytical chemistry.

[40]  Chuanli Wang,et al.  Systematic Comparison of C3 and C4 Plants Based on Metabolic Network Analysis , 2012, BMC Systems Biology.

[41]  Paul Medvedev,et al.  Error correction of high-throughput sequencing datasets with non-uniform coverage , 2011, Bioinform..

[42]  Sergey Koren,et al.  Aggressive assembly of pyrosequencing reads with mates , 2008, Bioinform..

[43]  David Hernández,et al.  De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. , 2008, Genome research.

[44]  Mihai Pop,et al.  Exploiting sparseness in de novo genome assembly , 2012, BMC Bioinformatics.

[45]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[46]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[47]  Yufeng Shen,et al.  Bos taurus genome assembly , 2009, BMC Genomics.

[48]  Mihai Pop,et al.  Sequencing and genome assembly using next-generation technologies. , 2010, Methods in molecular biology.

[49]  René L. Warren,et al.  Assembling millions of short DNA sequences using SSAKE , 2006, Bioinform..

[50]  Steven Skiena,et al.  Crystallizing short-read assemblies around seeds , 2009, BMC Bioinformatics.

[51]  Michael S. Waterman,et al.  A New Algorithm for DNA Sequence Assembly , 1995, J. Comput. Biol..

[52]  E. Eichler,et al.  Limitations of next-generation genome sequence assembly , 2011, Nature Methods.

[53]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[54]  Adel Dayarian,et al.  SOPRA: Scaffolding algorithm for paired reads via statistical optimization , 2010, BMC Bioinformatics.

[55]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[56]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[57]  Torsten Seemann,et al.  VAGUE: a graphical user interface for the Velvet assembler , 2013, Bioinform..

[58]  Lin Liu,et al.  Comparison of Next-Generation Sequencing Systems , 2012, Journal of biomedicine & biotechnology.

[59]  Paul Medvedev,et al.  Maximum Likelihood Genome Assembly , 2009, J. Comput. Biol..

[60]  Lucian Ilie,et al.  HiTEC: accurate error correction in high-throughput sequencing data , 2011, Bioinform..

[61]  Mihai Pop,et al.  Genome assembly reborn: recent computational challenges , 2009, Briefings Bioinform..

[62]  Haixu Tang,et al.  Fragment assembly with short reads , 2004, Bioinform..

[63]  M. Tomita,et al.  Mass spectrum sequential subtraction speeds up searching large peptide MS/MS spectra datasets against large nucleotide databases for proteogenomics , 2012, Genes to cells : devoted to molecular & cellular mechanisms.

[64]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[65]  Kunihiko Sadakane,et al.  Succinct de Bruijn Graphs , 2012, WABI.

[66]  Eugene W. Myers,et al.  Comparing Assemblies Using Fragments and Mate-Pairs , 2001, WABI.

[67]  Josephine A. Reinhardt,et al.  De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas syringae pv. oryzae. , 2009, Genome research.

[68]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[69]  Konrad H. Paszkiewicz,et al.  De novo assembly of short sequence reads , 2010, Briefings Bioinform..

[70]  Albert J. Vilella,et al.  Comparative and demographic analysis of orang-utan genomes , 2011, Nature.

[71]  Mark J. P. Chaisson,et al.  De novo fragment assembly with short mate-paired reads: Does the read length matter? , 2009, Genome research.

[72]  I. Rigoutsos,et al.  Evaluation of Methods for De Novo Genome Assembly from High-Throughput Sequencing Reads Reveals Dependencies That Affect the Quality of the Results , 2011, PloS one.

[73]  Miron Livny,et al.  Validation of rice genome sequence by optical mapping , 2007, BMC Genomics.

[74]  Hui Shen,et al.  Comparative studies of de novo assembly tools for next-generation sequencing technologies , 2011, Bioinform..

[75]  Steven J. M. Jones,et al.  De novo genome sequence assembly of a filamentous fungus using Sanger, 454 and Illumina sequence data , 2009, Genome Biology.

[76]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[77]  Esko Ukkonen,et al.  Fast scaffolding with small independent mixed integer programs , 2011, Bioinform..

[78]  Weng-Keen Wong,et al.  QSRA – a quality-value guided de novo short read assembler , 2009, BMC Bioinformatics.

[79]  A. Gnirke,et al.  ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads , 2009, Genome Biology.

[80]  S. Moore,et al.  Short reads, circular genome: skimming solid sequence to construct the bighorn sheep mitochondrial genome. , 2012, The Journal of heredity.

[81]  Jun S Liu,et al.  Bayesian biclustering of gene expression data , 2008, BMC Genomics.

[82]  Nilgun Donmez,et al.  SCARPA: scaffolding reads with practical algorithms , 2013, Bioinform..

[83]  Huzefa Rangwala,et al.  Evaluation of short read metagenomic assembly , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[84]  Stefan Kurtz,et al.  Readjoiner: a fast and memory efficient string graph-based sequence assembler , 2012, BMC Bioinformatics.

[85]  S. Salzberg,et al.  Hierarchical scaffolding with Bambus. , 2003, Genome research.

[86]  Srinivas Aluru,et al.  A survey of error-correction methods for next-generation sequencing , 2013, Briefings Bioinform..

[87]  Marcel J. T. Reinders,et al.  Integrating genome assemblies with MAIA , 2010, Bioinform..

[88]  Jill P. Mesirov,et al.  Computational Biology , 2018, Encyclopedia of Parallel Computing.

[89]  Haixu Tang,et al.  Fragment assembly with double-barreled data , 2001, ISMB.

[90]  Kamil Khanipov,et al.  Slim-Filter: an interactive windows-based application for illumina genome analyzer data assessment and manipulation , 2012, BMC Bioinformatics.

[91]  M. Schatz,et al.  Genome assembly forensics: finding the elusive mis-assembly , 2008, Genome Biology.

[92]  Vincent J. Magrini,et al.  Extending assembly of short DNA sequences to handle error , 2007, Bioinform..

[93]  Daniel R. Zerbino,et al.  Pebble and Rock Band: Heuristic Resolution of Repeats and Scaffolding in the Velvet Short-Read de Novo Assembler , 2009, PloS one.

[94]  B. Mishra,et al.  Feature-by-Feature – Evaluating De Novo Sequence Assembly , 2012, PloS one.

[95]  Sergey Koren,et al.  Bambus 2: scaffolding metagenomes , 2011, Bioinform..

[96]  Srinivas Aluru,et al.  Reptile: representative tiling for short read error correction , 2010, Bioinform..

[97]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[98]  H. Swerdlow,et al.  A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers , 2012, BMC Genomics.

[99]  Rita Casadio,et al.  Algorithms in Bioinformatics, 5th International Workshop, WABI 2005, Mallorca, Spain, October 3-6, 2005, Proceedings , 2005, WABI.

[100]  Hans Söderlund,et al.  SEQAID: a DNA sequence assembling program based on a mathematical model , 1984, Nucleic Acids Res..

[101]  B. Mishra,et al.  Comparing De Novo Genome Assembly: The Long and Short of It , 2011, PloS one.

[102]  Juliane C. Dohm,et al.  SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. , 2007, Genome research.

[103]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[104]  Mark J. P. Chaisson,et al.  Short read fragment assembly of bacterial genomes. , 2008, Genome research.

[105]  Dawei Li,et al.  The sequence and de novo assembly of the giant panda genome , 2010, Nature.

[106]  Páll Melsted,et al.  Efficient counting of k-mers in DNA sequences using a bloom filter , 2011, BMC Bioinformatics.

[107]  Walter Pirovano,et al.  BIOINFORMATICS APPLICATIONS , 2022 .

[108]  Bertil Schmidt,et al.  A fast hybrid short read fragment assembly algorithm , 2009, Bioinform..

[109]  Paul Medvedev,et al.  Paired de Bruijn Graphs: A Novel Approach for Incorporating Mate Pair Information into Genome Assemblers , 2011, RECOMB.

[110]  Eugene W. Myers,et al.  The greedy path-merging algorithm for contig scaffolding , 2002, JACM.