ddradseqtools: a software package for in silico simulation and testing of double‐digest RADseq experiments

Double‐digested RADseq (ddRADseq) is a NGS methodology that generates reads from thousands of loci targeted by restriction enzyme cut sites, across multiple individuals. To be statistically sound and economically optimal, a ddRADseq experiment has a preliminary design stage that needs to consider issues related to the selection of enzymes, particular features of the genome of the focal species, possible modifications to the library construction protocol, coverage needed to minimize missing data, and the potential sources of error that may impact upon the coverage. We present ddradseqtools, a software package to help ddRADseq experimental design by (i) the generation of in silico double‐digested fragments; (ii) the construction of modified ddRADseq libraries using adapters with either one or two indexes and degenerate base regions (DBRs) to quantify PCR duplicates; and (iii) the initial steps of the bioinformatics preprocessing of reads. ddradseqtools generates single‐end (SE) or paired‐end (PE) reads that may bear SNPs and/or indels. The effect of allele dropout and PCR duplicates on coverage is also simulated. The resulting output files can be submitted to pipelines of alignment and variant calling, to allow the fine‐tuning of parameters. The software was validated with specific tests for the correct operability of the program. The correspondence between in silico settings and parameters from ddRADseq in vitro experiments was assessed to provide guidelines for the reliable performance of the software. ddradseqtools is cost‐efficient in terms of execution time, and can be run on computers with standard CPU and RAM configuration.

[1]  M. Blaxter,et al.  RADSeq: next-generation population genetics. , 2010, Briefings in functional genomics.

[2]  Richard J. Roberts,et al.  REBASE—a database for DNA restriction and modification: enzymes, genes and genomes , 2009, Nucleic Acids Res..

[3]  G. King,et al.  Evaluation of Linkage Disequilibrium Pattern and Association Study on Seed Oil Content in Brassica napus Using ddRAD Sequencing , 2016, PloS one.

[4]  H. Hoekstra,et al.  Double Digest RADseq: An Inexpensive Method for De Novo SNP Discovery and Genotyping in Model and Non-Model Species , 2012, PloS one.

[5]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[6]  T. Cezard,et al.  Special features of RAD Sequencing data: implications for genotyping , 2012, Molecular ecology.

[7]  James A. Casbon,et al.  A method for counting PCR template molecules with application to next-generation sequencing , 2011, Nucleic acids research.

[8]  O. Lepais,et al.  SimRAD: an R package for simulation‐based prediction of the number of loci expected in RADseq and similar genotyping by sequencing approaches , 2014, Molecular ecology resources.

[9]  B. Emerson,et al.  Gene Duplication, Population Genomics, and Species-Level Differentiation within a Tropical Mountain Shrub , 2014, Genome biology and evolution.

[10]  M. Blaxter,et al.  Genome-wide genetic marker discovery and genotyping using next-generation sequencing , 2011, Nature Reviews Genetics.

[11]  F. Rheindt,et al.  Degenerate adaptor sequences for detecting PCR duplicates in reduced representation sequencing data improve genotype calling accuracy , 2015, Molecular ecology resources.

[12]  Le-Shin Wu,et al.  Decoding the massive genome of loblolly pine using haploid DNA and novel assembly strategies , 2014, Genome Biology.

[13]  P. Etter,et al.  Rapid SNP Discovery and Genetic Mapping Using Sequenced RAD Markers , 2008, PloS one.

[14]  B. Emerson,et al.  Restriction site‐associated DNA sequencing, genotyping error estimation and de novo assembly optimization for population genetic inference , 2015, Molecular ecology resources.

[15]  Michael G. Sovic,et al.  Hybrid origin of European Vipers (Vipera magnifica and Vipera orlovi) from the Caucasus determined using genomic scale DNA markers , 2016, BMC Evolutionary Biology.

[16]  J. DaCosta,et al.  Amplification Biases and Consistent Recovery of Loci in a Double-Digest RAD-seq Protocol , 2014, PloS one.

[17]  T. Cezard,et al.  The effect of RAD allele dropout on the estimation of genetic variation within and between populations , 2013, Molecular ecology.

[18]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration , 2012, Briefings Bioinform..

[19]  A. Meyer,et al.  A Hybrid Genetic Linkage Map of Two Ecologically and Morphologically Divergent Midas Cichlid Fishes (Amphilophus spp.) Obtained by Massively Parallel DNA Sequencing (ddRADSeq) , 2013, G3: Genes | Genomes | Genetics.

[20]  Aaron R. Quinlan,et al.  Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features , 2022 .

[21]  Florian Leese,et al.  Detection and Removal of PCR Duplicates in Population Genomic ddRAD Studies by Addition of a Degenerate Base Region (DBR) in Sequencing Adapters , 2014, The Biological Bulletin.

[22]  Deren A. R. Eaton,et al.  PyRAD: assembly of de novo RADseq loci for phylogenetic analyses , 2013, bioRxiv.

[23]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[24]  Angel Amores,et al.  Stacks: an analysis tool set for population genomics , 2013, Molecular ecology.

[25]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer , 2011, Nature Biotechnology.

[26]  P. Etter,et al.  SNP discovery and genotyping for evolutionary genetics using RAD sequencing. , 2011, Methods in molecular biology.

[27]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[28]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[29]  Edith D. Wong,et al.  The Reference Genome Sequence of Saccharomyces cerevisiae: Then and Now , 2013, G3: Genes, Genomes, Genetics.

[30]  Martin Krzywinski,et al.  Points of Significance: Replication , 2014, Nature Methods.