assignPOP: An r package for population assignment using genetic, non‐genetic, or integrated data in a machine‐learning framework

Summary 1.The use of biomarkers (e.g., genetic, microchemical, and morphometric characteristics) to discriminate among and assign individuals to a population can benefit species conservation and management by facilitating our ability to understand population structure and demography. 2.Tools that can evaluate the reliability of large genomic datasets for population discrimination and assignment, as well as allow their integration with non-genetic markers for the same purpose, are lacking. Our R package, assignPOP, provides both functions in a supervised machine-learning framework. 3.assignPOP uses Monte-Carlo and K-fold cross-validation procedures, as well as principal component analysis (PCA), to estimate assignment accuracy and membership probabilities, using training (i.e., baseline source population) and test (i.e., validation) datasets that are independent. A user then can build a specified predictive model based on the relative sizes of these datasets and classification functions, including linear discriminant analysis, support vector machine, naive Bayes, decision tree, and random forest. 4.assignPOP can benefit any researcher who seeks to use genetic or non-genetic data to infer population structure and membership of individuals. assignPOP is a freely available R package under the GPL license, and can be downloaded from CRAN or at https://github.com/alexkychen/assignPOP. A comprehensive tutorial can also be found at https://alexkychen.github.io/assignPOP/. This article is protected by copyright. All rights reserved.

[1]  Stephanie Manel,et al.  Assignment methods: matching biological questions with appropriate techniques. , 2005, Trends in ecology & evolution.

[2]  Kevin D. Friedland,et al.  Stock identification methods : applications in fishery science , 2005 .

[3]  J. Cornuet,et al.  GENECLASS2: a software for genetic assignment and first-generation migrant detection. , 2004, The Journal of heredity.

[4]  F. Balloux,et al.  Discriminant analysis of principal components: a new method for the analysis of genetically structured populations , 2010, BMC Genetics.

[5]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[6]  J. Kelly,et al.  Combining genetic markers and stable isotopes to reveal population connectivity and migration patterns in a Neotropical migrant, Wilson's warbler (Wilsonia pusilla) , 2003, Molecular ecology.

[7]  K. Ross Molecular ecology of social behaviour: analyses of breeding systems and genetic structure , 2001, Molecular ecology.

[8]  Gavin A. Begg,et al.  An holistic approach to fish stock identification , 1999 .

[9]  Laurent Excoffier,et al.  SIMCOAL 2.0: a program to simulate genomic diversity over large recombining regions in a subdivided population with a complex history , 2004, Bioinform..

[10]  Steven L Van Wilgenburg,et al.  Advances in Linking Wintering Migrant Birds to Their Breeding-Ground Origins Using Combined Analyses of Genetic and Stable Isotope Markers , 2012, PloS one.

[11]  O. Gaggiotti,et al.  INVITED REVIEW: What is a population? An empirical evaluation of some genetic methods for identifying the number of gene pools and their degree of connectivity , 2006, Molecular ecology.

[12]  S. Campana,et al.  Integrated stock mixture analysis for continous and categorical data, with application to genetic- otolith combinations , 2010 .

[13]  Colin W. Rundel,et al.  Novel statistical methods for integrating genetic and stable isotope data to infer individual‐level migratory connectivity , 2013, Molecular ecology.

[14]  M. Stephens,et al.  Inference of population structure using multilocus genotype data: dominant markers and null alleles , 2007, Molecular ecology notes.

[15]  L. Levin,et al.  Complex larval connectivity patterns among marine invertebrate populations , 2007, Proceedings of the National Academy of Sciences.

[16]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[17]  K. Mackenzie,et al.  Parasites as biological tags in population studies of marine organisms: an update , 2002, Parasitology.

[18]  R. Waples High‐grading bias: subtle problems with assessing power of selected subsets of loci for population assignment , 2010, Molecular ecology.

[19]  M. Hellberg,et al.  Genetic assessment of connectivity among marine populations , 2002 .

[20]  Kurt Hornik,et al.  Misc Functions of the Department of Statistics, ProbabilityTheory Group (Formerly: E1071), TU Wien , 2015 .

[21]  E. Anderson,et al.  Identifying Migrant Origins Using Genetics, Isotopes, and Habitat Suitability , 2016, bioRxiv.

[22]  Thibaut Jombart,et al.  adegenet: a R package for the multivariate analysis of genetic markers , 2008, Bioinform..

[23]  S. Czesny,et al.  Discrimination of wild and domestic origin of sturgeon ova based on lipids and fatty acid analysis , 2000 .

[24]  J. Kelly,et al.  COMBINING ISOTOPIC AND GENETIC MARKERS TO IDENTIFY BREEDING ORIGINS OF MIGRANT BIRDS , 2005 .

[25]  José Antonio Lozano,et al.  Sensitivity Analysis of k-Fold Cross Validation in Prediction Error Estimation , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Yi-Zeng Liang,et al.  Monte Carlo cross validation , 2001 .

[27]  Steven X. Cadrin,et al.  Advances in morphometric identification of fishery stocks , 2000, Reviews in Fish Biology and Fisheries.

[28]  S. Campana,et al.  Otolith Elemental Fingerprinting for Stock Identification of Atlantic Cod (Gadus morhua) Using Laser Ablation ICPMS , 1994 .

[29]  G. Guillot,et al.  A unifying model for the analysis of phenotypic, genetic, and geographic data. , 2011, Systematic biology.

[30]  G. Evanno,et al.  Coupling genetic and otolith trace element analyses to identify river-born fish with hatchery pedigrees in stocked Atlantic salmon (Salmo salar) populations , 2011 .

[31]  J. González‐Solís,et al.  Geographic assignment of seabirds to their origin: combining morphologic, genetic, and biogeochemical analyses. , 2007, Ecological applications : a publication of the Ecological Society of America.

[32]  S. Campana,et al.  Estimating contemporary early life‐history dispersal in an estuarine fish: integrating molecular and otolith elemental approaches , 2008, Molecular ecology.

[33]  Thibaut Jombart,et al.  adegenet 1.3-1: new tools for the analysis of genome-wide SNP data , 2011, Bioinform..

[34]  B. D. Hardesty,et al.  Genetic evidence of frequent long-distance recruitment in a vertebrate-dispersed tree. , 2006, Ecology letters.

[35]  B. Ripley Support Functions and Datasets for Venables and Ripley's MASS , 2015 .

[36]  H. Lisle Gibbs,et al.  Realized Reproductive Success of Polygynous Red-Winged Blackbirds Revealed by DNA Markers , 1990, Science.

[37]  H. Kaiser The Application of Electronic Computers to Factor Analysis , 1960 .

[38]  Jinliang Wang,et al.  The computer program structure for assigning individuals to populations: easy to use but easier to misuse , 2017, Molecular ecology resources.

[39]  M. Fuchs,et al.  Recognition and invasion of human skin by Schistosoma mansoni cercariae: the key-role of L-arginine , 2002, Parasitology.

[40]  L. Guttman Some necessary conditions for common-factor analysis , 1954 .

[41]  E C Anderson,et al.  Assessing the power of informative subsets of loci for population assignment: standard methods are upwardly biased , 2010, Molecular ecology resources.

[42]  F. Rousset genepop’007: a complete re‐implementation of the genepop software for Windows and Linux , 2008, Molecular ecology resources.

[43]  John M. Pearce,et al.  Genetic evidence of intercontinental movement of avian influenza in a migratory bird: the northern pintail (Anas acuta) , 2008, Molecular ecology.

[44]  K. Hobson Tracing origins and migration of wildlife using stable isotopes: a review , 1999, Oecologia.

[45]  S. Puechmaille The program structure does not reliably recover the correct population structure when sampling is uneven: subsampling and new estimators alleviate the problem , 2016, Molecular ecology resources.