AdmixPipe: population analyses in Admixture for non-model organisms

Background Research on the molecular ecology of non-model organisms, while previously constrained, has now been greatly facilitated by the advent of reduced-representation sequencing protocols. However, tools that allow these large datasets to be efficiently parsed are often lacking, or if indeed available, then limited by the necessity of a comparable reference genome as an adjunct. This, of course, can be difficult when working with non-model organisms. Fortunately, pipelines are currently available that avoid this prerequisite, thus allowing data to be a priori parsed. An oft-used molecular ecology program (i.e., Structure), for example, is facilitated by such pipelines, yet they are surprisingly absent for a second program that is similarly popular and computationally more efficient (i.e., Admixture). The two programs differ in that Admixture employs a maximum-likelihood framework whereas Structure uses a Bayesian approach, yet both produce similar results. Given these issues, there is an overriding (and recognized) need among researchers in molecular ecology for bioinformatic software that will not only condense output from replicated Admixture runs, but also infer from these data the optimal number of population clusters (K). Results Here we provide such a program (i.e., AdmixPipe) that (a) filters SNPs to allow the delineation of population structure in Admixture, then (b) parses the output for summarization and graphical representation via Clumpak. Our benchmarks effectively demonstrate how efficient the pipeline is for processing large, non-model datasets generated via double digest restriction-site associated DNA sequencing (ddRAD). Outputs not only parallel those from Structure, but also visualize the variation among individual Admixture runs, so as to facilitate selection of the most appropriate K-value. Conclusions AdmixPipe successfully integrates Admixture analysis with popular variant call format (VCF) filtering software to yield file types readily analyzed by Clumpak. Large population genomic datasets derived from non-model organisms are efficiently analyzed via the parallel-processing capabilities of Admixture. AdmixPipe is distributed under the GNU Public License and freely available for Mac OSX and Linux platforms at: https://github.com/stevemussmann/admixturePipeline.

[1]  J. DaCosta,et al.  Amplification Biases and Consistent Recovery of Loci in a Double-Digest RAD-seq Protocol , 2014, PloS one.

[2]  K. Glover,et al.  ParallelStructure: A R Package to Distribute Parallel Runs of the Population Genetics Program STRUCTURE on Multi-Core Computers , 2013, PloS one.

[3]  S. Narum,et al.  Genotyping‐in‐Thousands by sequencing (GT‐seq): A cost effective SNP genotyping method based on custom amplicon sequencing , 2015, Molecular ecology resources.

[4]  Naiara Rodríguez-Ezpeleta,et al.  Selecting RAD-Seq Data Analysis Parameters for Population Genetics: The More the Better? , 2019, Front. Genet..

[5]  M. Stephens,et al.  Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. , 2003, Genetics.

[6]  Isaac Overcast,et al.  ipyrad: Interactive assembly and analysis of RADseq datasets , 2020, Bioinform..

[7]  P. Smouse,et al.  genalex 6: genetic analysis in Excel. Population genetic software for teaching and research , 2006 .

[9]  Hybridization drives genetic erosion in sympatric desert fishes of western North America , 2019 .

[10]  Vikram E. Chhatre,et al.  StrAuto: automation and parallelization of STRUCTURE analysis , 2017, BMC Bioinformatics.

[11]  Cris E. Hughes,et al.  Patterns of Admixture and Population Structure in Native Populations of Northwest North America , 2014, PLoS genetics.

[12]  N. Rosenberg distruct: a program for the graphical display of population structure , 2003 .

[13]  F. De Filippis,et al.  A Selected Core Microbiome Drives the Early Stages of Three Popular Italian Cheese Manufactures , 2014, PloS one.

[14]  Deren A. R. Eaton,et al.  PyRAD: assembly of de novo RADseq loci for phylogenetic analyses , 2013, bioRxiv.

[15]  R. Jewkes,et al.  Perceptions and Experiences of Research Participants on Gender-Based Violence Community Based Survey: Implications for Ethical Guidelines , 2012, PloS one.

[16]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[17]  Noah A. Rosenberg,et al.  CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure , 2007, Bioinform..

[18]  C. Cullingham,et al.  The K = 2 conundrum , 2017, Molecular ecology.

[19]  Theunis Piersma,et al.  The interplay between habitat availability and population differentiation , 2012 .

[20]  Gordon Luikart,et al.  Conservation genomics of natural and managed populations: building a conceptual and practical framework , 2016, Molecular ecology.

[21]  Kenneth Lange,et al.  Enhancements to the ADMIXTURE algorithm for individual ancestry estimation , 2011, BMC Bioinformatics.

[22]  J. Puritz,et al.  These aren’t the loci you’e looking for: Principles of effective SNP filtering for molecular ecologists , 2018, Molecular ecology.

[23]  G. Evanno,et al.  Detecting the number of clusters of individuals using the software structure: a simulation study , 2005, Molecular ecology.

[24]  C. Battey,et al.  Minor allele frequency thresholds strongly affect population structure inference with genomic data sets , 2019, Molecular ecology resources.

[25]  Manuel Ruiz,et al.  SNiPlay3: a web-based application for exploration and large scale analyses of genomic variations , 2015, Nucleic Acids Res..

[26]  M. Stephens,et al.  Inferring weak population structure with the assistance of sample group information , 2009, Molecular ecology resources.

[27]  David H. Alexander,et al.  Fast model-based estimation of ancestry in unrelated individuals. , 2009, Genome research.

[28]  Bridgett M. vonHoldt,et al.  STRUCTURE HARVESTER: a website and program for visualizing STRUCTURE output and implementing the Evanno method , 2011, Conservation Genetics Resources.

[29]  M. Stephens,et al.  fastSTRUCTURE: Variational Inference of Population Structure in Large SNP Data Sets , 2014, Genetics.

[30]  A. Amores,et al.  Rapid and cost-effective polymorphism identification and genotyping using restriction site associated DNA (RAD) markers. , 2007, Genome research.

[31]  H. Hoekstra,et al.  Double Digest RADseq: An Inexpensive Method for De Novo SNP Discovery and Genotyping in Model and Non-Model Species , 2012, PloS one.

[32]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[33]  M. Jakobsson,et al.  Clumpak: a program for identifying clustering modes and packaging population structure inferences across K , 2015, Molecular ecology resources.

[34]  Aaron B. A. Shafer,et al.  Bioinformatic processing of RAD‐seq data dramatically impacts downstream population genetic inference , 2017 .

[35]  Tyler K. Chafin,et al.  Hybridization drives genetic erosion in sympatric desert fishes of western North America , 2019, Heredity.

[36]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..