Degenerate adaptor sequences for detecting PCR duplicates in reduced representation sequencing data improve genotype calling accuracy

RAD‐tag is a powerful tool for high‐throughput genotyping. It relies on PCR amplification of the starting material, following enzymatic digestion and sequencing adaptor ligation. Amplification introduces duplicate reads into the data, which arise from the same template molecule and are statistically nonindependent, potentially introducing errors into genotype calling. In shotgun sequencing, data duplicates are removed by filtering reads starting at the same position in the alignment. However, restriction enzymes target specific locations within the genome, causing reads to start in the same place, and making it difficult to estimate the extent of PCR duplication. Here, we introduce a slight change to the Illumina sequencing adaptor chemistry, appending a unique four‐base tag to the first index read, which allows duplicate discrimination in aligned data. This approach was validated on the Illumina MiSeq platform, using double‐digest libraries of ants (Wasmannia auropunctata) and yeast (Saccharomyces cerevisiae) with known genotypes, producing modest though statistically significant gains in the odds of calling a genotype accurately. More importantly, removing duplicates also corrected for strong sample‐to‐sample variability of genotype calling accuracy seen in the ant samples. For libraries prepared from low‐input degraded museum bird samples (Mixornis gularis), which had low complexity, having been generated from relatively few starting molecules, adaptor tags show that virtually all of the genotypes were called with inflated confidence as a result of PCR duplicates. Quantification of library complexity by adaptor tagging does not significantly increase the difficulty of the overall workflow or its cost, but corrects for differences in quality between samples and permits analysis of low‐input material.

[1]  S. Knapp,et al.  RAD tag sequencing as a source of SNP markers in Cynara cardunculus L , 2012, BMC Genomics.

[2]  P. Tubaro,et al.  DNA barcoding birds: from field collection to data analysis. , 2012, Methods in molecular biology.

[3]  Jesse J. Salk,et al.  Detection of ultra-rare mutations by next-generation sequencing , 2012, Proceedings of the National Academy of Sciences.

[4]  P. Etter,et al.  Rapid SNP Discovery and Genetic Mapping Using Sequenced RAD Markers , 2008, PloS one.

[5]  A. Mikheyev,et al.  Single‐queen introductions characterize regional and local invasions by the facultatively clonal little fire ant Wasmannia auropunctata , 2009, Molecular ecology.

[6]  T. Cezard,et al.  Special features of RAD Sequencing data: implications for genotyping , 2012, Molecular ecology.

[7]  R. Moritz,et al.  RESTseq – Efficient Benchtop Population Genomics with RESTriction Fragment SEQuencing , 2013, PloS one.

[8]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[9]  B. Shapiro,et al.  Case study: recovery of ancient nuclear DNA from toe pads of the extinct passenger pigeon. , 2012, Methods in molecular biology.

[10]  M. Matz,et al.  2b-RAD: a simple and flexible method for genome-wide genotyping , 2012, Nature Methods.

[11]  G. Valè,et al.  Identification of SNP and SSR markers in eggplant using RAD tag sequencing , 2011, BMC Genomics.

[12]  H. Hoekstra,et al.  Double Digest RADseq: An Inexpensive Method for De Novo SNP Discovery and Genotyping in Model and Non-Model Species , 2012, PloS one.

[13]  L. Keller,et al.  Clonal reproduction by males and females in the little fire ant , 2005, Nature.

[14]  K. Kinzler,et al.  Detection and quantification of rare mutations with massively parallel sequencing , 2011, Proceedings of the National Academy of Sciences.

[15]  C. Aquadro,et al.  Negative epistasis between natural variants of the Saccharomyces cerevisiae MLH1 and PMS1 genes results in a defect in mismatch repair. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Prakash Gorroochurn Classic Problems of Probability: Gorroochurn/Problems of Probability , 2012 .

[17]  M. Bekaert,et al.  Mapping the sex determination locus in the Atlantic halibut (Hippoglossus hippoglossus) using RAD sequencing , 2013, BMC Genomics.

[18]  B. Barrell,et al.  Life with 6000 Genes , 1996, Science.

[19]  A. Amores,et al.  Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences , 2011, G3: Genes | Genomes | Genetics.

[20]  Leonid V. Bystrykh,et al.  Generalized DNA Barcode Design Based on Hamming Codes , 2012, PloS one.

[21]  S. Boessenkool,et al.  Relict or colonizer? Extinction and range expansion of penguins in southern New Zealand , 2009, Proceedings of the Royal Society B: Biological Sciences.

[22]  Mauricio O. Carneiro,et al.  From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[23]  S. Pääbo,et al.  Genetic analyses from ancient DNA. , 2004, Annual review of genetics.

[24]  Cassandra B. Jabara,et al.  Accurate sampling and deep sequencing of the HIV-1 protease gene using a Primer ID , 2011, Proceedings of the National Academy of Sciences.

[25]  Mark L. Blaxter,et al.  Linkage Mapping and Comparative Genomics Using Next-Generation RAD Sequencing of a Non-Model Organism , 2011, PloS one.

[26]  Zechen Chong,et al.  Rainbow: an integrated tool for efficient clustering and assembling RAD-seq reads , 2012, Bioinform..

[27]  Z. Ning,et al.  Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of GC-biased genomes , 2009, Nature Methods.

[28]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[29]  A. Mikheyev,et al.  Sequencing Degraded DNA from Non-Destructively Sampled Museum Specimens for RAD-Tagging and Low-Coverage Shotgun Phylogenetics , 2014, PloS one.

[30]  P. Gorroochurn Classic Problems of Probability , 2016 .

[31]  L. Keller,et al.  Characterization and PCR multiplexing of polymorphic microsatellite loci for the invasive ant Wasmannia auropunctata , 2005 .

[32]  A. Amores,et al.  Rapid and cost-effective polymorphism identification and genotyping using restriction site associated DNA (RAD) markers. , 2007, Genome research.