Bioinformatic processing of RAD‐seq data dramatically impacts downstream population genetic inference

1. Restriction site-associated DNA sequencing (RAD-seq) provides high-resolution population genomic data at low cost, and has become an important component in ecological and evolutionary studies. As with all high-throughput technologies, analytic strategies require critical validation to ensure precise and unbiased interpretation. 2. To test the impact of bioinformatic data processing on downstream population genetic inferences, we analysed mammalian RAD-seq data (>100 individuals) with 312 combinations of methodology (de novo vs. mapping to references of increasing divergence) and filtering criteria (missing data, HWE, F-IS, coverage, mapping and genotype quality). In an effort to identify commonalities and biases in all pipelines, we computed summary statistics (nr. loci, nr. SNP, pi, Het(obs), F-IS, F-ST, N-e and m) and compared the results to independent null expectations (isolation-by-distance correlation, expected transition-to-transversion ratio T-s/T-v and Mendelian mismatch rates of known parent-offspring trios). 3. We observed large differences between reference-based and de novo approaches, the former generally calling more SNPs and reducing F-IS and T-s/T-v. Data completion levels showed little impact on most summary statistics, and FST estimates were robust across all pipelines. The site frequency spectrum was highly sensitive to the chosen approach as reflected in large variance of parameter estimates across demographic scenarios (single-population bottlenecks and isolation-with-migration model). Null expectations were best met by reference-based approaches, although contingent on the specific criteria. 4. We recommend that RAD-seq studies employ reference-based approaches to a closely related genome, and due to the high stochasticity associated with the pipeline advocate the use of multiple pipelines to ensure robust population genetic and demographic inferences.

[1]  D. Bolnick,et al.  Demystifying the RAD fad , 2014, Molecular ecology.

[2]  Heng Li,et al.  Toward better understanding of artifacts in variant calling from high-coverage samples , 2014, Bioinform..

[3]  J. DaCosta,et al.  Amplification Biases and Consistent Recovery of Loci in a Double-Digest RAD-seq Protocol , 2014, PloS one.

[4]  H. Ellegren Genome sequencing and population genomics in non-model organisms. , 2014, Trends in ecology & evolution.

[5]  Evan E. Eichler,et al.  Genetic variation and the de novo assembly of human genomes , 2015, Nature Reviews Genetics.

[6]  M. Thorne,et al.  A draft fur seal genome provides insights into factors affecting SNP validation and how to mitigate them , 2016, Molecular ecology resources.

[7]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[8]  F. Rousset Genetic differentiation and estimation of gene flow from F-statistics under isolation by distance. , 1997, Genetics.

[9]  T. Cezard,et al.  The effect of RAD allele dropout on the estimation of genetic variation within and between populations , 2013, Molecular ecology.

[10]  R. Sokal,et al.  Multiple regression and correlation extensions of the mantel test of matrix correspondence , 1986 .

[12]  Nicholas Stiffler,et al.  Population Genomics of Parallel Adaptation in Threespine Stickleback using Sequenced RAD Tags , 2010, PLoS genetics.

[13]  Michael Lynch,et al.  Estimation of Allele Frequencies From High-Coverage Genome-Sequencing Projects , 2009, Genetics.

[14]  B. Weir,et al.  ESTIMATING F‐STATISTICS FOR THE ANALYSIS OF POPULATION STRUCTURE , 1984, Evolution; international journal of organic evolution.

[15]  N. Mantel The detection of disease clustering and a generalized regression approach. , 1967, Cancer research.

[16]  C. Ponting,et al.  Sequencing depth and coverage: key considerations in genomic analyses , 2014, Nature Reviews Genetics.

[17]  N. Perrin,et al.  High-density sex-specific linkage maps of a European tree frog (Hyla arborea) identify the sex chromosome without information on offspring sex , 2015, Heredity.

[18]  F. Rousset genepop’007: a complete re‐implementation of the genepop software for Windows and Linux , 2008, Molecular ecology resources.

[19]  R. Nielsen,et al.  Population genetic inference from genomic sequence variation. , 2010, Genome research.

[20]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[21]  Gordon Luikart,et al.  Trade‐offs and utility of alternative RADseq methods: Reply to Puritz et al. , 2014, Molecular ecology.

[22]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[23]  J. Doležel,et al.  Nuclear DNA content and genome size of trout and human. , 2003, Cytometry. Part A : the journal of the International Society for Analytical Cytology.

[24]  A. Meyer,et al.  Genetic mapping of horizontal stripes in Lake Victoria cichlid fishes: benefits and pitfalls of using RAD markers for dense linkage mapping , 2014, Molecular ecology.

[25]  J. Puritz,et al.  dDocent: a RADseq, variant-calling pipeline designed for population genomics of non-model organisms , 2014, PeerJ.

[26]  Philip L. F. Johnson,et al.  Accounting for bias from sequencing error in population genetic estimates. , 2007, Molecular biology and evolution.

[27]  Lucie M. Gattepaille,et al.  Demographic inferences using short‐read genomic data in an approximate Bayesian computation framework: in silico evaluation of power, biases and proof of concept in Atlantic walrus , 2015, Molecular ecology.

[28]  Sarah C. Goslee,et al.  The ecodist Package for Dissimilarity-based Analysis of Ecological Data , 2007 .

[29]  Marcel Martin Cutadapt removes adapter sequences from high-throughput sequencing reads , 2011 .

[30]  H. Hoekstra,et al.  Double Digest RADseq: An Inexpensive Method for De Novo SNP Discovery and Genotyping in Model and Non-Model Species , 2012, PloS one.

[31]  J. Wolf,et al.  A field guide to whole-genome sequencing, assembly and annotation , 2014, Evolutionary applications.

[32]  Rebecca B. Dikow,et al.  Genomic resources for the endangered Hawaiian honeycreepers , 2014, BMC Genomics.

[33]  P. David,et al.  High-throughput sequencing reveals inbreeding depression in a natural population , 2014, Proceedings of the National Academy of Sciences.

[34]  Jiajie Zhang,et al.  PEAR: a fast and accurate Illumina Paired-End reAd mergeR , 2013, Bioinform..

[35]  Bo Du,et al.  C-values of Seven Marine Mammal Species Determined by Flow Cytometry , 2006, Zoological science.

[36]  E. Mandeville,et al.  Highly variable reproductive isolation among pairs of Catostomus species , 2015, Molecular ecology.

[37]  D. Tautz,et al.  Tracing early stages of species differentiation: Ecological, morphological and genetic divergence of Galápagos sea lion populations , 2008, BMC Evolutionary Biology.

[38]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[39]  Joseph M. Northrup,et al.  Forecasting Ecological Genomics: High-Tech Animal Instrumentation Meets High-Throughput Sequencing , 2016, PLoS biology.

[40]  Davoud Torkamaneh,et al.  Genome-Wide SNP Calling from Genotyping by Sequencing (GBS) Data: A Comparison of Seven Pipelines and Two Sequencing Technologies , 2016, PloS one.

[41]  B. Emerson,et al.  Restriction site‐associated DNA sequencing, genotyping error estimation and de novo assembly optimization for population genetic inference , 2015, Molecular ecology resources.

[42]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[43]  Martin Goodson,et al.  Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. , 2011, Genome research.

[44]  Dawei Li,et al.  The sequence and de novo assembly of the giant panda genome , 2010, Nature.

[45]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[46]  Angel Amores,et al.  Stacks: an analysis tool set for population genomics , 2013, Molecular ecology.

[47]  D. Coltman,et al.  Microsatellite assessment of walrus ( Odobenus rosmarus rosmarus ) stocks in Canada , 2013 .

[48]  Ion I. Mandoiu,et al.  Feature selection and classifier performance on diverse bio- logical datasets , 2014, BMC Bioinformatics.

[49]  E. Pante,et al.  Use of RAD sequencing for delimiting species , 2014, Heredity.

[50]  Zechen Chong,et al.  Rainbow: an integrated tool for efficient clustering and assembling RAD-seq reads , 2012, Bioinform..

[51]  Deren A. R. Eaton,et al.  PyRAD: assembly of de novo RADseq loci for phylogenetic analyses , 2013, bioRxiv.

[52]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[53]  Anders Albrechtsen,et al.  ANGSD: Analysis of Next Generation Sequencing Data , 2014, BMC Bioinformatics.

[54]  Russell B. Corbett-Detig,et al.  RADseq underestimates diversity and introduces genealogical biases due to nonrandom haplotype sampling , 2013, Molecular ecology.

[55]  Simon H. Martin,et al.  Genome-wide evidence for speciation with gene flow in Heliconius butterflies , 2013, Genome research.

[56]  Ryan D. Hernandez,et al.  Inferring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency Data , 2009, PLoS genetics.

[57]  G. Luikart,et al.  Harnessing the power of RADseq for ecological and evolutionary genomics , 2016, Nature Reviews Genetics.

[58]  Jun Wang,et al.  SNP Calling, Genotype Calling, and Sample Allele Frequency Estimation from New-Generation Sequencing Data , 2012, PloS one.

[59]  D. Phalen,et al.  Genome-wide SNP loci reveal novel insights into koala (Phascolarctos cinereus) population variability across its range , 2016, Conservation Genetics.

[60]  J. Palo,et al.  Demographic histories and genetic diversities of Fennoscandian marine and landlocked ringed seal subspecies , 2014, Ecology and evolution.

[61]  Nagarjun Vijay,et al.  Challenges and strategies in transcriptome assembly and differential gene expression quantification. A comprehensive in silico assessment of RNA‐seq experiments , 2013, Molecular ecology.

[62]  F. Trillmich,et al.  Male reproductive success and its behavioural correlates in a polygynous mammal, the Galápagos sea lion (Zalophus wollebaeki) , 2010, Molecular ecology.

[63]  Jiang Li,et al.  The effect of strand bias in Illumina short-read sequencing data , 2012, BMC Genomics.

[64]  N. J. Ouborg,et al.  Genomics and the challenging translation into conservation practice. , 2015, Trends in ecology & evolution.

[65]  T. Cezard,et al.  Special features of RAD Sequencing data: implications for genotyping , 2012, Molecular ecology.

[66]  H. Gibbs,et al.  AftrRAD: a pipeline for accurate and efficient de novo assembly of RADseq data , 2015, Molecular ecology resources.

[67]  Faye D. Schilkey,et al.  Genome‐wide association genetics of an adaptive trait in lodgepole pine , 2012, Molecular ecology.

[68]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.