A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments

BackgroundPCR amplification is an important step in the preparation of DNA sequencing libraries prior to high-throughput sequencing. PCR amplification introduces redundant reads in the sequence data and estimating the PCR duplication rate is important to assess the frequency of such reads. Existing computational methods do not distinguish PCR duplicates from “natural” read duplicates that represent independent DNA fragments and therefore, over-estimate the PCR duplication rate for DNA-seq and RNA-seq experiments.ResultsIn this paper, we present a computational method to estimate the average PCR duplication rate of high-throughput sequence datasets that accounts for natural read duplicates by leveraging heterozygous variants in an individual genome. Analysis of simulated data and exome sequence data from the 1000 Genomes project demonstrated that our method can accurately estimate the PCR duplication rate on paired-end as well as single-end read datasets which contain a high proportion of natural read duplicates. Further, analysis of exome datasets prepared using the Nextera library preparation method indicated that 45–50% of read duplicates correspond to natural read duplicates likely due to fragmentation bias. Finally, analysis of RNA-seq datasets from individuals in the 1000 Genomes project demonstrated that 70–95% of read duplicates observed in such datasets correspond to natural duplicates sampled from genes with high expression and identified outlier samples with a 2-fold greater PCR duplication rate than other samples.ConclusionsThe method described here is a useful tool for estimating the PCR duplication rate of high-throughput sequence datasets and for assessing the fraction of read duplicates that correspond to natural read duplicates. An implementation of the method is available at https://github.com/vibansal/PCRduplicates.

[1]  Kristen Jepsen,et al.  Biased estimates of clonal evolution and subclonal heterogeneity can arise from PCR duplicates in deep sequencing experiments , 2014, Genome Biology.

[2]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[3]  Irina I. Abnizova,et al.  Swift: primary data analysis for the Illumina Solexa sequencing platform , 2009, Bioinform..

[4]  K. Hansen,et al.  Biases in Illumina transcriptome sequencing caused by random hexamer priming , 2010, Nucleic acids research.

[5]  J. Kitzman,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Whole exome capture in solution with 3Gbp of data , 2010 .

[6]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[7]  Z. Ning,et al.  Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of GC-biased genomes , 2009, Nature Methods.

[8]  Eivind Hovig,et al.  Performance comparison of four exome capture systems for deep sequencing , 2014, BMC Genomics.

[9]  Timothy Daley,et al.  Predicting the molecular complexity of sequencing libraries , 2013, Nature Methods.

[10]  Orion J. Buske,et al.  iReckon: Simultaneous isoform discovery and abundance estimation from RNA-seq data , 2013, Genome research.

[11]  Pedro G. Ferreira,et al.  Transcriptome and genome sequencing uncovers functional variation in humans , 2013, Nature.

[12]  Andrew C. Adey,et al.  Rapid, low-input, low-bias construction of shotgun fragment libraries by high-density in vitro transposition , 2010, Genome Biology.

[13]  Funda Meric-Bernstam,et al.  Bias from removing read duplication in ultra-deep sequencing experiments , 2014, Bioinform..

[14]  Junji Uchida,et al.  High-fidelity target sequencing of individual molecules identified using barcode sequences: de novo detection and absolute quantitation of mutations in plasma cell-free DNA from cancer patients , 2015, DNA research : an international journal for rapid publication of reports on genes and genomes.

[15]  S. Linnarsson,et al.  Counting absolute numbers of molecules using unique molecular identifiers , 2011, Nature Methods.

[16]  Gioele La Manno,et al.  Quantitative single-cell RNA-seq with unique molecular identifiers , 2013, Nature Methods.

[17]  Jeroen F. J. Laros,et al.  Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories , 2013, Nature Biotechnology.

[18]  F. van Nieuwerburgh,et al.  Library construction for next-generation sequencing: overviews and challenges. , 2014, BioTechniques.

[19]  Michael A Quail,et al.  Improved Protocols for the Illumina Genome Analyzer Sequencing System , 2009, Current protocols in human genetics.

[20]  Iraad F Bronner,et al.  Improved Protocols for Illumina Sequencing. , 2013, Current protocols in human genetics.

[21]  T. Fennell,et al.  Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries , 2011, Genome Biology.

[22]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[23]  Jussi Taipale,et al.  Counting absolute number of molecules using unique molecular identifiers , 2011 .

[24]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[25]  M. von Zastrow,et al.  Membrane traffic in the post-genomic era , 2010, Genome Biology.

[26]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[27]  P. Bickel,et al.  Systematic evaluation of factors influencing ChIP-seq fidelity , 2012, Nature Methods.

[28]  James A. Casbon,et al.  A method for counting PCR template molecules with application to next-generation sequencing , 2011, Nucleic acids research.

[29]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.