Deriving genotypes from RAD-seq short-read data using Stacks

Restriction site-associated DNA sequencing (RAD-seq) allows for the genome-wide discovery and genotyping of single-nucleotide polymorphisms in hundreds of individuals at a time in model and nonmodel species alike. However, converting short-read sequencing data into reliable genotype data remains a nontrivial task, especially as RAD-seq is used in systems that have very diverse genomic properties. Here, we present a protocol to analyze RAD-seq data using the Stacks pipeline. This protocol will be of use in areas such as ecology and population genetics. It covers the assessment and demultiplexing of the sequencing data, read mapping, inference of RAD loci, genotype calling, and filtering of the output data, as well as providing two simple examples of downstream biological analyses. We place special emphasis on checking the soundness of the procedure and choosing the main parameters, given the properties of the data. The procedure can be completed in 1 week, but determining definitive methodological choices will typically take up to 1 month.

[1]  A. Amores,et al.  Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences , 2011, G3: Genes | Genomes | Genetics.

[2]  W. Cresko,et al.  The population structure and recent colonization history of Oregon threespine stickleback determined using restriction‐site associated DNA‐sequencing , 2013, Molecular ecology.

[3]  T. Shank,et al.  Predicting RAD-seq Marker Numbers across the Eukaryotic Tree of Life , 2015, Genome biology and evolution.

[4]  James M. Maley,et al.  Similarity thresholds used in DNA sequence assembly from short reads can reduce the comparability of population histories across species , 2015, PeerJ.

[5]  Marie L. Nydam,et al.  Defining Loci in Restriction-Based Reduced Representation Genomic Data from Nonmodel Species: Sources of Bias and Diagnostics for Optimal Clustering , 2014, BioMed research international.

[6]  Thibaut Jombart,et al.  adegenet 1.3-1: new tools for the analysis of genome-wide SNP data , 2011, Bioinform..

[7]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[8]  J. Puritz,et al.  dDocent: a RADseq, variant-calling pipeline designed for population genomics of non-model organisms , 2014, PeerJ.

[9]  P. Etter,et al.  Rapid SNP Discovery and Genetic Mapping Using Sequenced RAD Markers , 2008, PloS one.

[10]  Nicholas Stiffler,et al.  Population Genomics of Parallel Adaptation in Threespine Stickleback using Sequenced RAD Tags , 2010, PLoS genetics.

[11]  Axel Meyer,et al.  quaddRAD: a new high‐multiplexing and PCR duplicate removal ddRAD protocol produces novel evolutionary insights in a nonradiating cichlid lineage , 2017, Molecular ecology.

[12]  A. Meyer,et al.  Multispecies Outcomes of Sympatric Speciation after Admixture with the Source Population in Two Radiations of Nicaraguan Crater Lake Cichlids , 2016, PLoS genetics.

[13]  G. Luikart,et al.  Harnessing the power of RADseq for ecological and evolutionary genomics , 2016, Nature Reviews Genetics.

[14]  L. Duret,et al.  Comparative population genomics in animals uncovers the determinants of genetic diversity , 2014, Nature.

[15]  Aaron B. A. Shafer,et al.  Bioinformatic processing of RAD‐seq data dramatically impacts downstream population genetic inference , 2017 .

[16]  Bruce S. Weir,et al.  Genetic Data Analysis: Methods for Discrete Population Genetic Data. , 1991 .

[17]  Travis C Glenn,et al.  RADcap: sequence capture of dual‐digest RADseq libraries with identifiable duplicates and reduced missing data , 2016, Molecular ecology resources.

[18]  Anders Albrechtsen,et al.  ANGSD: Analysis of Next Generation Sequencing Data , 2014, BMC Bioinformatics.

[19]  S. Narum,et al.  Genotyping‐by‐sequencing in ecological and conservation genomics , 2013, Molecular ecology.

[20]  C. Schubart,et al.  Analysing intraspecific genetic variation: A practical guide using mitochondrial DNA and microsatellites , 2011 .

[21]  Angel Amores,et al.  Stacks: an analysis tool set for population genomics , 2013, Molecular ecology.

[22]  Josephine R. Paris,et al.  Lost in parameter space: a road map for stacks , 2017 .

[23]  Richard J. Challis,et al.  Genomic islands of speciation separate cichlid ecomorphs in an East African crater lake , 2015, Science.

[24]  Robert J. Elshire,et al.  Switchgrass Genomic Diversity, Ploidy, and Evolution: Novel Insights from a Network-Based SNP Discovery Protocol , 2013, PLoS genetics.

[25]  G. Luikart,et al.  RAD Capture (Rapture): Flexible and Efficient Sequence-Based Genotyping , 2015, Genetics.

[26]  Christopher E. Bird,et al.  ezRAD: a simplified method for genomic genotyping in non-model organisms , 2013, PeerJ.

[27]  Leping Li,et al.  Accurate anchoring alignment of divergent sequences , 2006, Bioinform..

[28]  Zechen Chong,et al.  Rainbow: an integrated tool for efficient clustering and assembling RAD-seq reads , 2012, Bioinform..

[29]  J. Postlethwait,et al.  A new model army: Emerging fish models to study the genomics of vertebrate Evo-Devo. , 2015, Journal of experimental zoology. Part B, Molecular and developmental evolution.

[30]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[31]  Deren A. R. Eaton,et al.  PyRAD: assembly of de novo RADseq loci for phylogenetic analyses , 2013, bioRxiv.

[32]  B. Browning,et al.  Haplotype phasing: existing methods and new developments , 2011, Nature Reviews Genetics.

[33]  Steven J. M. Jones,et al.  The Atlantic salmon genome provides insights into rediploidization , 2016, Nature.

[34]  P. Meirmans USING THE AMOVA FRAMEWORK TO ESTIMATE A STANDARDIZED GENETIC DIFFERENTIATION MEASURE , 2006, Evolution; international journal of organic evolution.

[35]  W. Cresko,et al.  Evolution of stickleback in 50 years on earthquake-uplifted islands , 2015, Proceedings of the National Academy of Sciences.

[36]  L. Excoffier,et al.  Analysis of molecular variance inferred from metric distances among DNA haplotypes: application to human mitochondrial DNA restriction data. , 1992, Genetics.

[37]  H. Hoekstra,et al.  Double Digest RADseq: An Inexpensive Method for De Novo SNP Discovery and Genotyping in Model and Non-Model Species , 2012, PloS one.

[38]  Nils Arrigo,et al.  Hybridization Capture Using RAD Probes (hyRAD), a New Tool for Performing Genomic Analyses on Collection Specimens , 2016, bioRxiv.

[39]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[40]  B. Weir Genetic Data Analysis II. , 1997 .

[41]  Robert J. Elshire,et al.  TASSEL-GBS: A High Capacity Genotyping by Sequencing Analysis Pipeline , 2014, PloS one.

[42]  Joshua S. Paul,et al.  Genotype and SNP calling from next-generation sequencing data , 2011, Nature Reviews Genetics.

[43]  Naiara Rodríguez-Ezpeleta,et al.  Population structure of Atlantic mackerel inferred from RAD‐seq‐derived SNP markers: effects of sequence clustering parameters and hierarchical SNP selection , 2016, Molecular ecology resources.

[44]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[45]  H. Gibbs,et al.  AftrRAD: a pipeline for accurate and efficient de novo assembly of RADseq data , 2015, Molecular ecology resources.

[46]  Robert J. Elshire,et al.  A Robust, Simple Genotyping-by-Sequencing (GBS) Approach for High Diversity Species , 2011, PloS one.

[47]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .