Assembly and Data Quality

Methods to assemble sequence reads into larger pieces are described. In many cases, the raw data of sequencing machines are pictures, which are translated in a subsequent analysis step (base calling) into sequence reads. Each position of a sequence read receives a quality score, indicating the probability of a sequencing error. After quality filtering and trimming of adapter regions or barcoding indices, these reads can be assembled de novo into larger pieces. Basically three different types of assembly strategies are in use: greedy algorithms, overlap-layout-consensus assemblers and methods relying on k-mer graphs. Overlapping reads producing contiguous sequences are named contigs. Positional information from paired-end reads or mate pairs can be used to order contigs into scaffolds. In the ideal case of genome sequencing, the number of scaffolds would equal the number of expected chromosomes. Several statistics can be used to describe or compare different sequence assemblies. Generally, a diversity of programs and chosen parameters should be explored to find the best assembly. Different strategies are used for genome, transcriptome and metagenome assemblies, and all of them greatly benefit from the inclusion of long reads. Assembly methods are becoming an increasingly important tool for everybody working with sequence data, since the vast majority of published sequence data in NCBI GenBank is deposited as short reads in the sequence read archive (► http://www.ncbi.nlm.nih.gov/sra/). This data is usually not directly searchable by methods like BLAST and needs to be assembled for subsequent analysis.

[1]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[2]  Martin Kircher,et al.  Improved base calling for the Illumina Genome Analyzer using machine learning strategies , 2009, Genome Biology.

[3]  Nuno A. Fonseca,et al.  Assemblathon 1: a competitive assessment of de novo short read assembly methods. , 2011, Genome research.

[4]  M. Pop,et al.  Sequence assembly demystified , 2013, Nature Reviews Genetics.

[5]  Thomas Hackl,et al.  proovread: large-scale high-accuracy PacBio correction through iterative short read consensus , 2014, Bioinform..

[6]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[7]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[9]  T Laver,et al.  Assessing the performance of the Oxford Nanopore Technologies MinION , 2015, Biomolecular detection and quantification.

[10]  C. Hill,et al.  Biotechnological applications of functional metagenomics in the food and pharmaceutical industries , 2015, Front. Microbiol..

[11]  Dawei Li,et al.  The sequence and de novo assembly of the giant panda genome , 2010, Nature.

[12]  S. Kelly,et al.  TransRate: reference-free quality assessment of de novo transcriptome assemblies , 2015, bioRxiv.

[13]  Matei David,et al.  Nanocall: an open source basecaller for Oxford Nanopore sequencing data , 2016, bioRxiv.

[14]  Margaret C. Linak,et al.  Sequence-specific error profile of Illumina sequencers , 2011, Nucleic acids research.

[15]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[16]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[17]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[18]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[19]  Wing-Kin Sung,et al.  Opera: Reconstructing Optimal Genomic Scaffolds with High-Throughput Paired-End Sequences , 2011, J. Comput. Biol..

[20]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[21]  A. Künstner,et al.  ConDeTri - A Content Dependent Read Trimmer for Illumina Data , 2011, PloS one.

[22]  Sara Goodwin,et al.  Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome , 2015, bioRxiv.

[23]  Steven J. M. Jones,et al.  De novo assembly and analysis of RNA-seq data , 2010, Nature Methods.

[24]  Zhong Wang,et al.  Next-generation transcriptome assembly , 2011, Nature Reviews Genetics.

[25]  Sergey Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[26]  Martin Kircher,et al.  Addressing challenges in the production and analysis of illumina sequencing data , 2011, BMC Genomics.

[27]  Heng Li,et al.  Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences , 2015, Bioinform..

[28]  M. Schatz,et al.  Assembly of large genomes using second-generation sequencing. , 2010, Genome research.

[29]  Yasubumi Sakakibara,et al.  MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads , 2012, Nucleic acids research.

[30]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[31]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[32]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[33]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[34]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[35]  Erich Bornberg-Bauer,et al.  DOGMA: domain-based transcriptome and proteome quality assessment , 2016, Bioinform..

[36]  Felipe Zapata,et al.  Toward a statistically explicit understanding of de novo sequence assembly , 2013, Bioinform..

[37]  Mark Howison,et al.  Bayesian Genome Assembly and Assessment by Markov Chain Monte Carlo Sampling , 2013, PloS one.

[38]  Nilgun Donmez,et al.  SCARPA: scaffolding reads with practical algorithms , 2013, Bioinform..

[39]  Hing-Fung Ting,et al.  MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices. , 2016, Methods.

[40]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[41]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[42]  Janet Kelso,et al.  freeIbis: an efficient basecaller with calibrated quality scores for Illumina sequencers , 2013, Bioinform..

[43]  Evgeny M. Zdobnov,et al.  BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs , 2015, Bioinform..

[44]  Siu-Ming Yiu,et al.  IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler , 2010, RECOMB.

[45]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[46]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[47]  Pavel A Pevzner,et al.  TruSPAdes: barcode assembly of TruSeq synthetic long reads , 2016, Nature Methods.

[48]  Björn Usadel,et al.  Trimmomatic: a flexible trimmer for Illumina sequence data , 2014, Bioinform..

[49]  Karolj Skala,et al.  Evaluation of hybrid and non-hybrid methods for de novo assembly of nanopore reads , 2015, bioRxiv.

[50]  M. Berriman,et al.  A comprehensive evaluation of assembly scaffolding tools , 2014, Genome Biology.

[51]  Pavel A. Pevzner,et al.  Assembly of long error-prone reads using de Bruijn graphs , 2016, Proceedings of the National Academy of Sciences.

[52]  Siu-Ming Yiu,et al.  IDBA-tran: a more robust de novo de Bruijn graph assembler for transcriptomes with uneven expression levels , 2013, Bioinform..

[53]  A. Gnirke,et al.  ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads , 2009, Genome Biology.

[54]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[55]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[56]  Leena Salmela,et al.  LoRDEC: accurate and efficient long read error correction , 2014, Bioinform..

[57]  Guojun Li,et al.  The Impacts of Read Length and Transcriptome Complexity for De Novo Assembly: A Simulation Study , 2014, PloS one.

[58]  T. Wetter,et al.  Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. , 2004, Genome research.

[59]  David Hernández,et al.  De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. , 2008, Genome research.

[60]  Martin Vingron,et al.  Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels , 2012, Bioinform..

[61]  Paul Medvedev,et al.  Informed and automated k-mer size selection for genome assembly , 2013, Bioinform..

[62]  Siu-Ming Yiu,et al.  Meta-IDBA: a de Novo assembler for metagenomic data , 2011, Bioinform..

[63]  Walter Pirovano,et al.  BIOINFORMATICS APPLICATIONS , 2022 .