NGmerge: merging paired-end reads via novel empirically-derived models of sequencing errors

BackgroundAdvances in Illumina DNA sequencing technology have produced longer paired-end reads that increasingly have sequence overlaps. These reads can be merged into a single read that spans the full length of the original DNA fragment, allowing for error correction and accurate determination of read coverage. Extant merging programs utilize simplistic or unverified models for the selection of bases and quality scores for the overlapping region of merged reads.ResultsWe first examined the baseline quality score - error rate relationship using sequence reads derived from PhiX. In contrast to numerous published reports, we found that the quality scores produced by Illumina were not substantially inflated above the theoretical values, once the reference genome was corrected for unreported sequence variants. The PhiX reads were then used to create empirical models of sequencing errors in overlapping regions of paired-end reads, and these models were incorporated into a novel merging program, NGmerge. We demonstrate that NGmerge corrects errors and ambiguous bases better than other merging programs, and that it assigns quality scores for merged bases that accurately reflect the error rates. Our results also show that, contrary to published analyses, the sequencing errors of paired-end reads are not independent.ConclusionsWe provide a free and open-source program, NGmerge, that performs better than existing read merging programs. NGmerge is available on GitHub (https://github.com/harvardinformatics/NGmerge) under the MIT License; it is written in C and supported on Linux.

[1]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[2]  Bart Barrell,et al.  The nucleotide sequence of bacteriophage φX174 , 1978 .

[3]  M. Snyder,et al.  High-throughput sequencing technologies. , 2015, Molecular cell.

[4]  Erik Aronesty,et al.  Comparison of Sequencing Utility Programs , 2013 .

[5]  Steven Salzberg,et al.  BIOINFORMATICS ORIGINAL PAPER , 2004 .

[6]  Samuel P. Strom Current practices and guidelines for clinical next-generation sequencing oncology testing , 2016, Cancer biology & medicine.

[7]  Huanming Yang,et al.  SNP detection for massively parallel whole-genome resequencing. , 2009, Genome research.

[8]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[9]  John M. Gaspar,et al.  FlowClus: efficiently filtering and denoising pyrosequenced amplicons , 2015, BMC Bioinformatics.

[10]  T Friedmann,et al.  The nucleotide sequence of bacteriophage phiX174. , 1978, Journal of molecular biology.

[11]  Christopher A. Miller,et al.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.

[12]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[13]  Leigh J Manley,et al.  Monitoring Error Rates In Illumina Sequencing. , 2016, Journal of biomolecular techniques : JBT.

[14]  Robert C. Edgar,et al.  Error filtering, pair assembly and error correction for next-generation sequencing reads , 2015, Bioinform..

[15]  C. Robert,et al.  Culture of previously uncultured members of the human gut microbiota by culturomics , 2016, Nature Microbiology.

[16]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[17]  Byunghan Lee,et al.  CASPER: context-aware scheme for paired-end reads from high-throughput amplicon sequencing , 2014, BMC Bioinformatics.

[18]  Jiajie Zhang,et al.  PEAR: a fast and accurate Illumina Paired-End reAd mergeR , 2013, Bioinform..

[19]  F. Bushman,et al.  QIIME allows integration and analysis of high-throughput community sequencing data. Nat. Meth. , 2010 .

[20]  Ben Nichols,et al.  Distributed under Creative Commons Cc-by 4.0 Vsearch: a Versatile Open Source Tool for Metagenomics , 2022 .

[21]  G. Pavesi,et al.  Evaluation of Quality Assessment Protocols for High Throughput Genome Resequencing Data , 2017, Front. Genet..

[22]  Samuel P. Strom Current practices and guidelines for clinical next-generation sequencing oncology testing@@@Current practices and guidelines for clinical next-generation sequencing oncology testing , 2016 .

[23]  Russell Weiner,et al.  Navigating the Rapids: The Development of Regulated Next-Generation Sequencing-Based Clinical Trial Assays and Companion Diagnostics , 2014, Front. Oncol..

[24]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[25]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[26]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[27]  M. Sogin,et al.  A Filtering Method to Generate High Quality Short Reads Using Illumina Paired-End Technology , 2013, PloS one.

[28]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[29]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.