论文信息 - Assembly and Data Quality

Assembly and Data Quality

Methods to assemble sequence reads into larger pieces are described. In many cases, the raw data of sequencing machines are pictures, which are translated in a subsequent analysis step (base calling) into sequence reads. Each position of a sequence read receives a quality score, indicating the probability of a sequencing error. After quality filtering and trimming of adapter regions or barcoding indices, these reads can be assembled de novo into larger pieces. Basically three different types of assembly strategies are in use: greedy algorithms, overlap-layout-consensus assemblers and methods relying on k-mer graphs. Overlapping reads producing contiguous sequences are named contigs. Positional information from paired-end reads or mate pairs can be used to order contigs into scaffolds. In the ideal case of genome sequencing, the number of scaffolds would equal the number of expected chromosomes. Several statistics can be used to describe or compare different sequence assemblies. Generally, a diversity of programs and chosen parameters should be explored to find the best assembly. Different strategies are used for genome, transcriptome and metagenome assemblies, and all of them greatly benefit from the inclusion of long reads. Assembly methods are becoming an increasingly important tool for everybody working with sequence data, since the vast majority of published sequence data in NCBI GenBank is deposited as short reads in the sequence read archive (► http://www.ncbi.nlm.nih.gov/sra/). This data is usually not directly searchable by methods like BLAST and needs to be assembled for subsequent analysis.

Christoph Bleidorn | C. Bleidorn

[1] M. Schatz,et al. Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[2] Martin Kircher,et al. Improved base calling for the Illumina Genome Analyzer using machine learning strategies , 2009, Genome Biology.

[3] Nuno A. Fonseca,et al. Assemblathon 1: a competitive assessment of de novo short read assembly methods. , 2011, Genome research.

[4] M. Pop,et al. Sequence assembly demystified , 2013, Nature Reviews Genetics.

[5] Thomas Hackl,et al. proovread: large-scale high-accuracy PacBio correction through iterative short read consensus , 2014, Bioinform..

[6] Peter M. Rice,et al. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[7] P. Pevzner,et al. An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[8] Jian Wang,et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[9] T Laver,et al. Assessing the performance of the Oxford Nanopore Technologies MinION , 2015, Biomolecular detection and quantification.

[10] C. Hill,et al. Biotechnological applications of functional metagenomics in the food and pharmaceutical industries , 2015, Front. Microbiol..

[11] Dawei Li,et al. The sequence and de novo assembly of the giant panda genome , 2010, Nature.

[12] S. Kelly,et al. TransRate: reference-free quality assessment of de novo transcriptome assemblies , 2015, bioRxiv.

[13] Matei David,et al. Nanocall: an open source basecaller for Oxford Nanopore sequencing data , 2016, bioRxiv.

[14] Margaret C. Linak,et al. Sequence-specific error profile of Illumina sequencers , 2011, Nucleic acids research.

[15] Sergey I. Nikolenko,et al. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[16] David R. Kelley,et al. Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[17] M. Schatz,et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[18] X. Huang,et al. CAP3: A DNA sequence assembly program. , 1999, Genome research.

[19] Wing-Kin Sung,et al. Opera: Reconstructing Optimal Genomic Scaffolds with High-Throughput Paired-End Sequences , 2011, J. Comput. Biol..