SNP genotyping and parameter estimation in polyploids using low‐coverage sequencing data

Motivation Genotyping and parameter estimation using high throughput sequencing data are everyday tasks for population geneticists, but methods developed for diploids are typically not applicable to polyploid taxa. This is due to their duplicated chromosomes, as well as the complex patterns of allelic exchange that often accompany whole genome duplication (WGD) events. For WGDs within a single lineage (autopolyploids), inbreeding can result from mixed mating and/or double reduction. For WGDs that involve hybridization (allopolyploids), alleles are typically inherited through independently segregating subgenomes. Results We present two new models for estimating genotypes and population genetic parameters from genotype likelihoods for auto‐ and allopolyploids. We then use simulations to compare these models to existing approaches at varying depths of sequencing coverage and ploidy levels. These simulations show that our models typically have lower levels of estimation error for genotype and parameter estimates, especially when sequencing coverage is low. Finally, we also apply these models to two empirical datasets from the literature. Overall, we show that the use of genotype likelihoods to model non‐standard inheritance patterns is a promising approach for conducting population genomic inferences in polyploids. Availability and implementation A C ++ program, EBG, is provided to perform inference using the models we describe. It is available under the GNU GPLv3 on GitHub: https://github.com/pblischak/polyploid‐genotyping.

[1]  O. Hardy Population genetics of autopolyploids under a mixed mating model and the estimation of selfing rate , 2016, Molecular ecology resources.

[2]  Joshua S. Paul,et al.  Genotype and SNP calling from next-generation sequencing data , 2011, Nature Reviews Genetics.

[3]  D. Balding,et al.  A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity , 2005, Genetica.

[4]  D. Soltis,et al.  The role of genetic and genomic attributes in the success of polyploids. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[5]  R. Crawford,et al.  Genome sequence of dwarf birch (Betula nana) and cross‐species RAD markers , 2013, Molecular ecology.

[6]  M. Lascoux,et al.  Hybrid origins and the earliest stages of diploidization in the highly successful recent polyploid Capsella bursa-pastoris , 2014, Proceedings of the National Academy of Sciences.

[7]  R. Gregory The evolution of the genome , 2005 .

[8]  Dirk Eddelbuettel,et al.  Seamless R and C++ Integration with Rcpp , 2013 .

[9]  O. Gaggiotti,et al.  A Genome-Scan Method to Identify Selected Loci Appropriate for Both Dominant and Codominant Markers: A Bayesian Perspective , 2008, Genetics.

[10]  Roeland E. Voorrips,et al.  Genotype calling in tetraploid species from bi-allelic marker data using mixture models , 2011, BMC Bioinformatics.

[11]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[12]  Z. Gompert,et al.  A Hierarchical Bayesian Model for Next-Generation Population Genomics , 2011, Genetics.

[13]  Jun Wang,et al.  SNP Calling, Genotype Calling, and Sample Allele Frequency Estimation from New-Generation Sequencing Data , 2012, PloS one.

[14]  S. Otto,et al.  Polyploid incidence and evolution. , 2000, Annual review of genetics.

[15]  Allison J. Miller,et al.  Single nucleotide polymorphism discovery via genotyping by sequencing to assess population genetic structure and recurrent polyploidization in Andropogon gerardii. , 2016, American journal of botany.

[16]  R. Nielsen,et al.  Quantifying Population Genetic Differentiation from Next-Generation Sequencing Data , 2013, Genetics.

[17]  Itay Mayrose,et al.  The frequency of polyploid speciation in vascular plants , 2009, Proceedings of the National Academy of Sciences.

[18]  R. Nielsen,et al.  Estimating inbreeding coefficients from NGS data: Impact on genotype calling and allele frequency estimation , 2013, Genome research.

[19]  M. Stephens,et al.  Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. , 2003, Genetics.

[20]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[21]  B. Arnold,et al.  Single Geographic Origin of a Widespread Autotetraploid Arabidopsis arenosa Lineage Followed by Interploidy Admixture. , 2015, Molecular biology and evolution.

[22]  Gongyi Huang,et al.  An empirical Bayes method for genotyping and SNP detection using multi-sample next-generation sequencing data , 2016, Bioinform..

[23]  Xiao-Li Meng,et al.  Maximum likelihood estimation via the ECM algorithm: A general framework , 1993 .

[24]  J. Rogers POLYPLOIDY IN FUNGI , 1973, Evolution; international journal of organic evolution.

[25]  R. Nichols,et al.  Unidirectional diploid–tetraploid introgression among British birch trees with shifting ranges shown by restriction site‐associated markers , 2016, Molecular ecology.

[26]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[27]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[28]  Thibaut Jombart,et al.  adegenet 1.3-1: new tools for the analysis of genome-wide SNP data , 2011, Bioinform..

[29]  Michael A. Schmidt,et al.  SeqEM: an adaptive genotype-calling approach for next-generation sequencing studies , 2010, Bioinform..

[30]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[31]  M. Lynch,et al.  Genotype Calling from Population-Genomic Sequencing Data , 2017, G3: Genes, Genomes, Genetics.

[32]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[34]  Dirk Eddelbuettel,et al.  Rcpp: Seamless R and C++ Integration , 2011 .

[35]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[36]  M. Lascoux,et al.  Genomic signature of successful colonization of Eurasia by the allopolyploid shepherd's purse (Capsella bursa‐pastoris) , 2016, Molecular ecology.

[37]  J. Haldane Theoretical genetics of autopolyploids , 1930, Journal of Genetics.

[38]  D. Balding,et al.  Significant genetic correlations among Caucasians at forensic DNA loci , 1997, Heredity.

[39]  G. Ledyard Stebbins,et al.  Variation and Evolution in Plants , 1951 .

[40]  Pamela S Soltis,et al.  The polyploidy revolution then…and now: Stebbins revisited. , 2014, American journal of botany.

[41]  Dipak K Dey,et al.  A Bayesian approach to inferring population structure from dominant markers , 2002, Molecular ecology.

[42]  Heng Li Mathematical Notes on SAMtools Algorithms , 2010 .

[43]  Peter L. Ralph,et al.  DISENTANGLING THE EFFECTS OF GEOGRAPHIC AND ECOLOGICAL ISOLATION ON GENETIC DIFFERENTIATION , 2013, Evolution; international journal of organic evolution.

[44]  J. Fordyce,et al.  Bayesian analysis of molecular variance in pyrosequences quantifies population genetic structure across the genome of Lycaeides butterflies , 2010, Molecular ecology.

[45]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[46]  Robert J. Elshire,et al.  TASSEL-GBS: A High Capacity Genotyping by Sequencing Analysis Pipeline , 2014, PloS one.

[47]  T. Gregory,et al.  Polyploidy in Animals , 2005 .

[48]  S. Wright,et al.  Evolution in Mendelian Populations. , 1931, Genetics.

[49]  Robert J. Elshire,et al.  Switchgrass Genomic Diversity, Ploidy, and Evolution: Novel Insights from a Network-Based SNP Discovery Protocol , 2013, PLoS genetics.

[50]  L. Rieseberg,et al.  Plant Speciation , 2007, Science.