Sources of PCR-induced distortions in high-throughput sequencing data sets

PCR permits the exponential and sequence-specific amplification of DNA, even from minute starting quantities. PCR is a fundamental step in preparing DNA samples for high-throughput sequencing. However, there are errors associated with PCR-mediated amplification. Here we examine the effects of four important sources of error—bias, stochasticity, template switches and polymerase errors—on sequence representation in low-input next-generation sequencing libraries. We designed a pool of diverse PCR amplicons with a defined structure, and then used Illumina sequencing to search for signatures of each process. We further developed quantitative models for each process, and compared predictions of these models to our experimental data. We find that PCR stochasticity is the major force skewing sequence representation after amplification of a pool of unique DNA amplicons. Polymerase errors become very common in later cycles of PCR but have little impact on the overall sequence distribution as they are confined to small copy numbers. PCR template switches are rare and confined to low copy numbers. Our results provide a theoretical basis for removing distortions from high-throughput sequencing data. In addition, our findings on PCR stochasticity will have particular relevance to quantification of results from single cell sequencing, in which sequences are represented by only one or a few molecules.

[1]  X. Xie,et al.  Genome-Wide Detection of Single-Nucleotide and Copy-Number Variations of a Single Human Cell , 2012, Science.

[2]  S. Linnarsson,et al.  Counting absolute numbers of molecules using unique molecular identifiers , 2011, Nature Methods.

[3]  Nadia Lalam,et al.  Modelling the PCR amplification process by a size-dependent branching process and estimation of the efficiency , 2004, Advances in Applied Probability.

[4]  T. Kunkel,et al.  Fidelity of DNA synthesis by the Thermus aquaticus DNA polymerase. , 1988, Biochemistry.

[5]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[6]  Jiajie Zhang,et al.  PEAR: a fast and accurate Illumina Paired-End reAd mergeR , 2013, Bioinform..

[7]  N. Lennon,et al.  Characterizing and measuring bias in sequence data , 2013, Genome Biology.

[8]  B. Haas,et al.  Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. , 2011, Genome research.

[9]  S. P. Fodor,et al.  Counting individual DNA molecules by the stochastic attachment of diverse labels , 2011, Proceedings of the National Academy of Sciences.

[10]  Didier Piau,et al.  Confidence intervals for nonhomogeneous branching processes and polymerase chain reactions , 2005 .

[11]  Russell J. Davenport,et al.  Removing Noise From Pyrosequenced Amplicons , 2011, BMC Bioinformatics.

[12]  R. Sandberg,et al.  Full-Length mRNA-Seq from single cell levels of RNA and individual circulating tumor cells , 2012, Nature Biotechnology.

[13]  Phillips,et al.  Antisense RNA Amplification: A Linear Amplification Method for Analyzing the mRNA Population from Single Living Cells , 1996, Methods.

[14]  C. Quince,et al.  Accurate determination of microbial diversity from 454 pyrosequencing data , 2009, Nature Methods.

[15]  James A. Casbon,et al.  A method for counting PCR template molecules with application to next-generation sequencing , 2011, Nucleic acids research.

[16]  J. Troge,et al.  Tumour evolution inferred by single-cell sequencing , 2011, Nature.

[17]  G Stolovitzky,et al.  Efficiency of DNA replication in the polymerase chain reaction. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[18]  T. Kunkel,et al.  DNA polymerase fidelity and the polymerase chain reaction. , 1991, PCR methods and applications.

[19]  J. Peccoud,et al.  Theoretical uncertainty of measurements using quantitative polymerase chain reaction. , 1996, Biophysical journal.

[20]  Arjang Hassibi,et al.  A STOCHASTIC MODEL AND SIMULATION ALGORITHM FOR POLYMERASE CHAIN REACTION ( PCR ) SYSTEMS , 2004 .

[21]  Jesse Dabney,et al.  Length and GC-biases during sequencing library amplification: a comparison of various polymerase-buffer systems with ancient and modern DNA sequencing libraries. , 2012, BioTechniques.

[22]  Tony Z. Jia,et al.  Digital RNA sequencing minimizes sequence-dependent bias and amplification noise with optimized single-molecule barcodes , 2012, Proceedings of the National Academy of Sciences.

[23]  P Gill,et al.  Application of low copy number DNA profiling. , 2001, Croatian medical journal.

[24]  S. Pääbo,et al.  DNA damage promotes jumping between templates during enzymatic amplification. , 1990, The Journal of biological chemistry.

[25]  K. Kinzler,et al.  Detection and quantification of rare mutations with massively parallel sequencing , 2011, Proceedings of the National Academy of Sciences.

[26]  W. Thilly,et al.  Fidelity of DNA polymerases in DNA amplification. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[27]  R. Knight,et al.  Rapid denoising of pyrosequencing amplicon data: exploiting the rank-abundance distribution , 2010, Nature Methods.

[28]  Aleksandra A. Kolodziejczyk,et al.  Accounting for technical noise in single-cell RNA-seq experiments , 2013, Nature Methods.

[29]  Kevin M Weeks,et al.  Structure-independent and quantitative ligation of single-stranded DNA. , 2006, Analytical biochemistry.

[30]  S. Odelberg,et al.  Template-switching during DNA synthesis by Thermus aquaticus DNA polymerase I. , 1995, Nucleic acids research.

[31]  A. Oudenaarden,et al.  Validation of noise models for single-cell transcriptomics , 2014, Nature Methods.

[32]  T. Fennell,et al.  Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries , 2011, Genome Biology.

[33]  P. Jagers,et al.  Random variation and concentration effects in PCR. , 2003, Journal of theoretical biology.

[34]  S. Kingsmore,et al.  Comprehensive human genome amplification using multiple displacement amplification , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[35]  P Taberlet,et al.  Reliable genotyping of samples with very low DNA quantities using PCR. , 1996, Nucleic acids research.