Detection and Removal of PCR Duplicates in Population Genomic ddRAD Studies by Addition of a Degenerate Base Region (DBR) in Sequencing Adapters

Restriction-site associated DNA sequencing (RAD) has emerged as a powerful marker system for studying genome-wide DNA polymorphisms using next-generation sequencing. A recent technical facilitation of RAD is double-digest RAD (ddRAD), which utilizes two restriction enzymes for library preparation. The more flexible and balanced ddRAD allows analysis of genomic loci in hundreds of individuals. However, in contrast to paired-end sequencing of traditional RAD libraries, PCR duplicates cannot be detected with ddRAD. This is a concern because duplicates can contribute substantially to read coverage data and erroneously inflate the proportion of homozygous loci (allele dropout). Allele dropout can bias population genetic parameter inference and complicate the detection of outlier loci under selection. Here we outline a simple and straightforward approach to detecting PCR duplicates from ddRAD libraries. Our approach introduces a degenerate base region (DBR, 12,288 unique combinations) in the sequencing adapter. We demonstrate the high efficiency and low rate of false positives in simulations. In addition, a pilot study was performed to test this approach on six aquatic invertebrates, sequenced on a HiSeq 2500 sequencer. The reads of the ddRAD libraries consisted of 33.48% PCR duplicates distributed on 19.40% of the loci. A disproportionate number of PCR duplicates were detected in only 4.66% of the loci. While this should not be a concern for general parameter inference, outlier loci detection in particular would be improved by the DBR technique. Given the easy and straightforward application of the technique in other RAD protocols as well, we suggest that DBR regions should generally be included in PCR-based RAD studies.

[1]  Angel Amores,et al.  Stacks: an analysis tool set for population genomics , 2013, Molecular ecology.

[2]  Lira Mamanova,et al.  FRT-seq: Amplification-free, strand-specific, transcriptome sequencing , 2010, Nature Methods.

[3]  T. Cezard,et al.  Special features of RAD Sequencing data: implications for genotyping , 2012, Molecular ecology.

[4]  John SantaLucia,et al.  Nearest-neighbor thermodynamics of deoxyinosine pairs in DNA duplexes , 2005, Nucleic acids research.

[5]  Christopher E. Bird,et al.  ezRAD: a simplified method for genomic genotyping in non-model organisms , 2013, PeerJ.

[6]  W. Cresko,et al.  Extensive linkage disequilibrium and parallel adaptive divergence across threespine stickleback genomes , 2012, Philosophical Transactions of the Royal Society B: Biological Sciences.

[7]  T. Fennell,et al.  Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries , 2011, Genome Biology.

[8]  A. Amores,et al.  Rapid and cost-effective polymorphism identification and genotyping using restriction site associated DNA (RAD) markers. , 2007, Genome research.

[9]  Daniel A. Skelly,et al.  A powerful and flexible statistical framework for testing hypotheses of allele-specific gene expression from RNA-seq data. , 2011, Genome research.

[10]  R. T. Brumfield,et al.  Applications of next-generation sequencing to phylogeography and phylogenetics. , 2013, Molecular phylogenetics and evolution.

[11]  G. Luikart,et al.  Genomic patterns of introgression in rainbow and westslope cutthroat trout illuminated by overlapping paired‐end RAD sequencing , 2013, Molecular ecology.

[12]  G. Luikart,et al.  Recent novel approaches for population genomics data analysis , 2014, Molecular ecology.

[13]  P. Sunnucks,et al.  Numerous transposed sequences of mitochondrial cytochrome oxidase I-II in aphids of the genus Sitobion (Hemiptera: Aphididae). , 1996, Molecular biology and evolution.

[14]  S. Narum,et al.  Population genomics of Pacific lamprey: adaptive variation in a highly dispersive species , 2013, Molecular ecology.

[15]  P. Etter,et al.  Rapid SNP Discovery and Genetic Mapping Using Sequenced RAD Markers , 2008, PloS one.

[16]  Kevin J. Emerson,et al.  Resolving postglacial phylogeography using high-throughput sequencing , 2010, Proceedings of the National Academy of Sciences.

[17]  M. Matz,et al.  2b-RAD: a simple and flexible method for genome-wide genotyping , 2012, Nature Methods.

[18]  H. Hoekstra,et al.  Double Digest RADseq: An Inexpensive Method for De Novo SNP Discovery and Genotyping in Model and Non-Model Species , 2012, PloS one.

[19]  Björn Usadel,et al.  Trimmomatic: a flexible trimmer for Illumina sequence data , 2014, Bioinform..

[20]  James A. Casbon,et al.  A method for counting PCR template molecules with application to next-generation sequencing , 2011, Nucleic acids research.

[21]  Russell B. Corbett-Detig,et al.  RADseq underestimates diversity and introduces genealogical biases due to nonrandom haplotype sampling , 2013, Molecular ecology.

[22]  Z. Ning,et al.  Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of GC-biased genomes , 2009, Nature Methods.

[23]  S. Narum,et al.  Genotyping‐by‐sequencing in ecological and conservation genomics , 2013, Molecular ecology.

[24]  T. Cezard,et al.  The effect of RAD allele dropout on the estimation of genetic variation within and between populations , 2013, Molecular ecology.

[25]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.