Quality control of next-generation sequencing data without a reference

Next-generation sequencing (NGS) technologies have dramatically expanded the breadth of genomics. Genome-scale data, once restricted to a small number of biomedical model organisms, can now be generated for virtually any species at remarkable speed and low cost. Yet non-model organisms often lack a suitable reference to map sequence reads against, making alignment-based quality control (QC) of NGS data more challenging than cases where a well-assembled genome is already available. Here we show that by generating a rapid, non-optimized draft assembly of raw reads, it is possible to obtain reliable and informative QC metrics, thus removing the need for a high quality reference. We use benchmark datasets generated from control samples across a range of genome sizes to illustrate that QC inferences made using draft assemblies are broadly equivalent to those made using a well-established reference, and describe QC tools routinely used in our production facility to assess the quality of NGS data from non-model organisms.

[1]  Robert A. Edwards,et al.  Quality control and preprocessing of metagenomic datasets , 2011, Bioinform..

[2]  Mukesh Jain,et al.  NGS QC Toolkit: A Toolkit for Quality Control of Next Generation Sequencing Data , 2012, PloS one.

[3]  R. Edwards,et al.  Fast Identification and Removal of Sequence Contamination from Genomic and Metagenomic Datasets , 2011, PloS one.

[4]  Jan Schröder,et al.  Reference-Free Validation of Short Read Data , 2010, PloS one.

[5]  M. Blaxter,et al.  Simultaneous genome sequencing of symbionts and their hosts , 2011, Symbiosis.

[6]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[7]  Darío Guerrero-Fernández,et al.  Why Assembling Plant Genome Sequences Is So Challenging , 2012, Biology.

[8]  P. Somervuo,et al.  Quality Control and Preprocessing , 2014 .

[9]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[10]  Jie Ding,et al.  Estimation of sequencing error rates in short reads , 2012, BMC Bioinformatics.

[11]  Dawei Li,et al.  The sequence and de novo assembly of the giant panda genome , 2010, Nature.

[12]  Andrey Alexeyenko,et al.  Efficient de novo assembly of large and complex genomes by massively parallel sequencing of Fosmid pools , 2013, BMC Genomics.

[13]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[14]  A. Tretyn,et al.  Sequencing technologies and genome sequencing , 2011, Journal of Applied Genetics.

[15]  Jared T. Simpson,et al.  Exploring genome characteristics and sequence quality without a reference , 2013, Bioinform..

[16]  Paul Medvedev,et al.  Informed and automated k-mer size selection for genome assembly , 2013, Bioinform..

[17]  Xun Xu,et al.  SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads , 2013, Bioinform..

[18]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[19]  Andreas Wilke,et al.  A Platform-Independent Method for Detecting Errors in Metagenomic Sequencing Data: DRISEE , 2012, PLoS Comput. Biol..

[20]  Mark Stitt,et al.  RobiNA: a user-friendly, integrated software solution for RNA-Seq-based transcriptomics , 2012, Nucleic Acids Res..

[21]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[22]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[23]  Jian Xu,et al.  QC-Chain: Fast and Holistic Quality Control Method for Next-Generation Sequencing Data , 2013, PloS one.

[24]  M. Blaxter,et al.  Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots , 2013, Front. Genet..

[25]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .