Prevention, diagnosis and treatment of high‐throughput sequencing data pathologies

High‐throughput sequencing (HTS) technologies generate millions of sequence reads from DNA/RNA molecules rapidly and cost‐effectively, enabling single investigator laboratories to address a variety of ‘omics’ questions in nonmodel organisms, fundamentally changing the way genomic approaches are used to advance biological research. One major challenge posed by HTS is the complexity and difficulty of data quality control (QC). While QC issues associated with sample isolation, library preparation and sequencing are well known and protocols for their handling are widely available, the QC of the actual sequence reads generated by HTS is often overlooked. HTS‐generated sequence reads can contain various errors, biases and artefacts whose identification and amelioration can greatly impact subsequent data analysis. However, a systematic survey on QC procedures for HTS data is still lacking. In this review, we begin by presenting standard ‘health check‐up’ QC procedures recommended for HTS data sets and establishing what ‘healthy’ HTS data look like. We next proceed by classifying errors, biases and artefacts present in HTS data into three major types of ‘pathologies’, discussing their causes and symptoms and illustrating with examples their diagnosis and impact on downstream analyses. We conclude this review by offering examples of successful ‘treatment’ protocols and recommendations on standard practices and treatment options. Notwithstanding the speed with which HTS technologies – and consequently their pathologies – change, we argue that careful QC of HTS data is an important – yet often neglected – aspect of their application in molecular ecology, and lay the groundwork for developing a HTS data QC ‘best practices’ guide.

[1]  Kan Liu,et al.  BIGpre: A Quality Assessment Package for Next-Generation Sequencing Data , 2011, Genom. Proteom. Bioinform..

[2]  S. Salzberg,et al.  Repetitive DNA and next-generation sequencing: computational challenges and solutions , 2011, Nature Reviews Genetics.

[3]  Matthew D. Young,et al.  From RNA-seq reads to differential expression results , 2010, Genome Biology.

[4]  Lira Mamanova,et al.  FRT-seq: Amplification-free, strand-specific, transcriptome sequencing , 2010, Nature Methods.

[5]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[6]  Gregory B. Gloor,et al.  XORRO: Rapid Paired-End Read Overlapper , 2013, 1304.4620.

[7]  P. Bickel,et al.  Systematic evaluation of factors influencing ChIP-seq fidelity , 2012, Nature Methods.

[8]  Andreas Wilke,et al.  A Platform-Independent Method for Detecting Errors in Metagenomic Sequencing Data: DRISEE , 2012, PLoS Comput. Biol..

[9]  Lucian Ilie,et al.  RACER: Rapid and accurate correction of errors in reads , 2013, Bioinform..

[10]  Jung-Hsien Chiang,et al.  Optimizing Information in Next-Generation-Sequencing (NGS) Reads for Improving De Novo Genome Assembly , 2013, PloS one.

[11]  Sergey I. Nikolenko,et al.  BayesHammer: Bayesian clustering for error correction in single-cell sequencing , 2012, BMC Genomics.

[12]  A. Murat Eren,et al.  DRISEE overestimates errors in metagenomic sequencing data , 2013, Briefings Bioinform..

[13]  Shilin Chen,et al.  FastUniq: A Fast De Novo Duplicates Removal Tool for Paired Short Reads , 2012, PloS one.

[14]  Mukesh Jain,et al.  NGS QC Toolkit: A Toolkit for Quality Control of Next Generation Sequencing Data , 2012, PloS one.

[15]  Zhen Yue,et al.  pIRS: Profile-based Illumina pair-end reads simulator , 2012, Bioinform..

[16]  Matthew E Hudson,et al.  Sequencing breakthroughs for genomic ecology and evolutionary biology , 2008, Molecular ecology resources.

[17]  Thomas Werner,et al.  Next generation sequencing in functional genomics , 2010, Briefings Bioinform..

[18]  Robert Schmieder,et al.  SEQanswers: an open access community for collaboratively decoding genomes , 2012, Bioinform..

[19]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[20]  Tony Z. Jia,et al.  Digital RNA sequencing minimizes sequence-dependent bias and amplification noise with optimized single-molecule barcodes , 2012, Proceedings of the National Academy of Sciences.

[21]  Colin N. Dewey,et al.  De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis , 2013, Nature Protocols.

[22]  Trevor J Pugh,et al.  Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation , 2013, Nucleic acids research.

[23]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[24]  Itai Yanai,et al.  ELOPER: elongation of paired-end reads as a pre-processing tool for improved de novo genome assembly , 2013, Bioinform..

[25]  S. Schuster Next-generation sequencing transforms today's biology , 2008, Nature Methods.

[26]  Juliane C. Dohm,et al.  Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems , 2011, Genome Biology.

[27]  E. Mardis Next-generation sequencing platforms. , 2013, Annual review of analytical chemistry.

[28]  Zhong Wang,et al.  Next-generation transcriptome assembly , 2011, Nature Reviews Genetics.

[29]  Martin Sikora,et al.  Pulling out the 1%: whole-genome capture for the targeted enrichment of ancient DNA sequencing libraries. , 2013, American journal of human genetics.

[30]  Nilgun Donmez,et al.  Hapsembler: An Assembler for Highly Polymorphic Genomes , 2011, RECOMB.

[31]  Huanming Yang,et al.  Deep RNA sequencing at single base-pair resolution reveals high complexity of the rice transcriptome. , 2010, Genome research.

[32]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[33]  E. Kandel,et al.  Proceedings of the National Academy of Sciences of the United States of America. Annual subject and author indexes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[34]  E. Eichler,et al.  Limitations of next-generation genome sequence assembly , 2011, Nature Methods.

[35]  G. Getz,et al.  PathSeq: software to identify or discover microbes by deep sequencing of human tissue , 2011, Nature Biotechnology.

[36]  Siu-Ming Yiu,et al.  COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly , 2012, Bioinform..

[37]  Mark Stitt,et al.  RobiNA: a user-friendly, integrated software solution for RNA-Seq-based transcriptomics , 2012, Nucleic Acids Res..

[38]  K. Hansen,et al.  Biases in Illumina transcriptome sequencing caused by random hexamer priming , 2010, Nucleic acids research.

[39]  Cole Trapnell,et al.  Improving RNA-Seq expression estimates by correcting for fragment bias , 2011, Genome Biology.

[40]  Rebecca W. Doerge,et al.  Robust adjustment of sequence tag abundance , 2014, Bioinform..

[41]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[42]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[43]  Michael B. Eisen,et al.  Improving transcriptome assembly through error correction of high-throughput sequence reads , 2013, PeerJ.

[44]  Erik Aronesty,et al.  Comparison of Sequencing Utility Programs , 2013 .

[45]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[46]  I-Min A. Chen,et al.  The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata , 2011, Nucleic Acids Res..

[47]  C. Buell,et al.  Tapping the promise of genomics in species with complex, nonmodel genomes. , 2013, Annual review of plant biology.

[48]  Stinus Lindgreen,et al.  AdapterRemoval: easy cleaning of next-generation sequencing reads , 2012, BMC Research Notes.

[49]  Y. Benjamini,et al.  Summarizing and correcting the GC content bias in high-throughput sequencing , 2012, Nucleic acids research.

[50]  Hervé Philippe,et al.  Origin of land plants revisited in the light of sequence contamination and missing data , 2012, Current Biology.

[51]  Nicholas A. Bokulich,et al.  Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing , 2012, Nature Methods.

[52]  R. O’Neill,et al.  Abundant Human DNA Contamination Identified in Non-Primate Genome Databases , 2011, PloS one.

[53]  Jiajie Zhang,et al.  PEAR: a fast and accurate Illumina Paired-End reAd mergeR , 2013, Bioinform..

[54]  I-Min A. Chen,et al.  The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata , 2007, Nucleic Acids Res..

[55]  S. Brisse,et al.  AlienTrimmer: a tool to quickly and accurately trim off multiple short contaminant sequences from high-throughput sequencing reads. , 2013, Genomics.

[56]  T. Fennell,et al.  Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries , 2011, Genome Biology.

[57]  Konrad H. Paszkiewicz,et al.  De novo assembly of short sequence reads , 2010, Briefings Bioinform..

[58]  Rafael A Irizarry,et al.  Overcoming bias and systematic errors in next generation sequencing data , 2010, Genome Medicine.

[59]  Roland Eils,et al.  Coverage Bias and Sensitivity of Variant Calling for Four Whole-genome Sequencing Technologies , 2013, PloS one.

[60]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[61]  N. Galtier,et al.  Next‐generation sequencing of transcriptomes: a guide to RNA isolation in nonmodel animals , 2011, Molecular ecology resources.

[62]  M. Berriman,et al.  REAPR: a universal tool for genome assembly evaluation , 2013, Genome Biology.

[63]  Christoph Held,et al.  Exploring Pandora's Box: Potential and Pitfalls of Low Coverage Genome Surveys for Evolutionary Biology , 2012, PloS one.

[64]  Antonis Rokas,et al.  Harnessing genomics for evolutionary insights. , 2009, Trends in ecology & evolution.

[65]  Inge Jonassen,et al.  Filtering duplicate reads from 454 pyrosequencing data , 2013, Bioinform..

[66]  Jared T. Simpson,et al.  Exploring genome characteristics and sequence quality without a reference , 2013, Bioinform..

[67]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[68]  Erez Lieberman Aiden,et al.  The expanding scope of DNA sequencing , 2012, Nature Biotechnology.

[69]  Michael Eisenstein,et al.  Oxford Nanopore announcement sets sequencing sector abuzz , 2012, Nature Biotechnology.

[70]  Daniel G. Brown,et al.  PANDAseq: paired-end assembler for illumina sequences , 2012, BMC Bioinformatics.

[71]  Martin Kircher,et al.  Addressing challenges in the production and analysis of illumina sequencing data , 2011, BMC Genomics.

[72]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[73]  Leighton J. Core,et al.  Precise Maps of RNA Polymerase Reveal How Promoters Direct Initiation and Pausing , 2013, Science.

[74]  Christophe Klopp,et al.  Assessment of replicate bias in 454 pyrosequencing and a multi-purpose read-filtering tool , 2011, BMC Research Notes.

[75]  Arend Hintze,et al.  Scaling metagenome sequence assembly with probabilistic de Bruijn graphs , 2011, Proceedings of the National Academy of Sciences.

[76]  J. Galindo,et al.  Applications of next generation sequencing in molecular ecology of non-model organisms , 2011, Heredity.

[77]  Srinivas Aluru,et al.  A survey of error-correction methods for next-generation sequencing , 2013, Briefings Bioinform..

[78]  K. Hansen,et al.  Removing technical variability in RNA-seq data using conditional quantile normalization , 2012, Biostatistics.

[79]  Marcel H. Schulz,et al.  Probabilistic error correction for RNA sequencing , 2013, Nucleic acids research.

[80]  Dee R. Denver,et al.  TileQC: A system for tile-based quality control of Solexa data , 2008, BMC Bioinformatics.

[81]  Yong Kong,et al.  Btrim: A fast, lightweight adapter and quality trimming program for next-generation sequencing technologies , 2011, Genomics.

[82]  Yongchao Liu,et al.  Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data , 2013, Bioinform..

[83]  Timothy P. L. Smith,et al.  Reducing assembly complexity of microbial genomes with single-molecule sequencing , 2013, Genome Biology.

[84]  James A. Yorke,et al.  QuorUM: An Error Corrector for Illumina Reads , 2013, PloS one.

[85]  D. Spooner,et al.  All biological disciplines that depend on DNA sequence data have been fundamentally changed in the last few years, driven by the development and emergence of next-generation sequenc- , 2012 .

[86]  M. Blaxter,et al.  Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots , 2013, Front. Genet..

[87]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[88]  C. Nusbaum,et al.  Finished bacterial genomes from shotgun sequence data , 2012, Genome research.

[89]  W. Wong,et al.  Modeling non-uniformity in short-read rates in RNA-Seq data , 2010, Genome Biology.

[90]  Agus Salim,et al.  Statistical challenges associated with detecting copy number variations with next-generation sequencing , 2012, Bioinform..

[91]  E. Cuppen,et al.  Systematic biases in DNA copy number originate from isolation procedures , 2013, Genome Biology.

[92]  J. Kitzman,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Whole exome capture in solution with 3Gbp of data , 2010 .

[93]  Chris Williams,et al.  RNA-SeQC: RNA-seq metrics for quality control and process optimization , 2012, Bioinform..

[94]  Jian Xu,et al.  QC-Chain: Fast and Holistic Quality Control Method for Next-Generation Sequencing Data , 2013, PloS one.

[95]  J. Wolf Principles of transcriptome analysis and gene expression quantification: an RNA‐seq tutorial , 2013, Molecular ecology resources.

[96]  Matthew S. Burriesci,et al.  Fulcrum: condensing redundant reads from high-throughput sequencing studies , 2012, Bioinform..

[97]  E. Mardis The impact of next-generation sequencing technology on genetics. , 2008, Trends in genetics : TIG.

[98]  Tom O. Delmont,et al.  Mastering methodological pitfalls for surviving the metagenomic jungle , 2013, BioEssays : news and reviews in molecular, cellular and developmental biology.

[99]  Nicolas Servant,et al.  A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis , 2013, Briefings Bioinform..

[100]  Matthew D. MacManes,et al.  On the optimal trimming of high-throughput mRNA sequence data , 2014, Front. Genet..

[101]  Hannah Jaris,et al.  The simple fool's guide to population genomics via RNA‐Seq: an introduction to high‐throughput sequencing data analysis , 2012, Molecular ecology resources.

[102]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[103]  T. Dallman,et al.  Performance comparison of benchtop high-throughput sequencing platforms , 2012, Nature Biotechnology.

[104]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[105]  Z. Ning,et al.  Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of GC-biased genomes , 2009, Nature Methods.

[106]  Robert A. Edwards,et al.  Quality control and preprocessing of metagenomic datasets , 2011, Bioinform..

[107]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[108]  Carsten O. Daub,et al.  TagDust—a program to eliminate artifacts from next generation sequencing data , 2009, Bioinform..

[109]  J. Ahringer,et al.  Systematic bias in high-throughput sequencing data and its correction by BEADS , 2011, Nucleic acids research.

[110]  R. Edwards,et al.  Fast Identification and Removal of Sequence Contamination from Genomic and Metagenomic Datasets , 2011, PloS one.

[111]  Joseph K. Pickrell,et al.  Understanding mechanisms underlying human gene expression variation with RNA sequencing , 2010, Nature.

[112]  Richard M. Leggett,et al.  NextClip: an analysis and read preparation tool for Nextera Long Mate Pair libraries , 2013, Bioinform..

[113]  Lars Bolund,et al.  State of the art de novo assembly of human genomes from massively parallel sequencing data , 2010, Human Genomics.

[114]  A. Künstner,et al.  ConDeTri - A Content Dependent Read Trimmer for Illumina Data , 2011, PloS one.

[115]  Anton J. Enright,et al.  Kraken: A set of tools for quality control and analysis of high-throughput sequence data , 2013, Methods.

[116]  Jun Wu,et al.  HTQC: a fast quality control toolkit for Illumina sequencing data , 2013, BMC Bioinformatics.

[117]  Orion J. Buske,et al.  iReckon: Simultaneous isoform discovery and abundance estimation from RNA-seq data , 2013, Genome research.

[118]  Andrew C. Adey,et al.  Rapid, low-input, low-bias construction of shotgun fragment libraries by high-density in vitro transposition , 2010, Genome Biology.

[119]  Ute Baumann,et al.  Sequencing error correction without a reference genome , 2013, BMC Bioinformatics.

[120]  H. Swerdlow,et al.  A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers , 2012, BMC Genomics.

[121]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[122]  T. Glenn Field guide to next‐generation DNA sequencers , 2011, Molecular ecology resources.

[123]  Tim H. Brom,et al.  A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data , 2012, 1203.4802.

[124]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[125]  Bhagya K. Wijayawardena,et al.  Of contigs and quagmires: next‐generation sequencing pitfalls associated with transcriptomic studies , 2013, Molecular ecology resources.

[126]  Yunlong Liu,et al.  NGSUtils: a software suite for analyzing and manipulating next-generation sequencing datasets , 2013, Bioinform..

[127]  Brent S. Pedersen,et al.  BioStar: An Online Question & Answer Resource for the Bioinformatics Community , 2011, PLoS Comput. Biol..

[128]  Marcel Martin Cutadapt removes adapter sequences from high-throughput sequencing reads , 2011 .

[129]  Tsunglin Liu,et al.  Effects of GC Bias in Next-Generation-Sequencing Data on De Novo Genome Assembly , 2013, PloS one.

[130]  Xun Xu,et al.  SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads , 2013, Bioinform..

[131]  Sallie W. Chisholm,et al.  Unlocking Short Read Sequencing for Metagenomics , 2010, PloS one.

[132]  Steven Salzberg,et al.  BIOINFORMATICS ORIGINAL PAPER , 2004 .

[133]  Detlef Weigel,et al.  Next Generation Molecular Ecology , 2010, Molecular ecology.

[134]  Sandrine Dudoit,et al.  GC-Content Normalization for RNA-Seq Data , 2011, BMC Bioinformatics.

[135]  Monzoorul Haque Mohammed,et al.  Eu-Detect: An algorithm for detecting eukaryotic sequences in metagenomic data sets , 2011, Journal of Biosciences.

[136]  S. Luo,et al.  Chimeric transcript discovery by paired-end transcriptome sequencing , 2009, Proceedings of the National Academy of Sciences.

[137]  Patrick J. Biggs,et al.  SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data , 2010, BMC Bioinformatics.