Applications of random forest feature selection for fine‐scale genetic population assignment

Genetic population assignment used to inform wildlife management and conservation efforts requires panels of highly informative genetic markers and sensitive assignment tests. We explored the utility of machine‐learning algorithms (random forest, regularized random forest and guided regularized random forest) compared with FST ranking for selection of single nucleotide polymorphisms (SNP) for fine‐scale population assignment. We applied these methods to an unpublished SNP data set for Atlantic salmon (Salmo salar) and a published SNP data set for Alaskan Chinook salmon (Oncorhynchus tshawytscha). In each species, we identified the minimum panel size required to obtain a self‐assignment accuracy of at least 90% using each method to create panels of 50–700 markers Panels of SNPs identified using random forest‐based methods performed up to 7.8 and 11.2 percentage points better than FST‐selected panels of similar size for the Atlantic salmon and Chinook salmon data, respectively. Self‐assignment accuracy ≥90% was obtained with panels of 670 and 384 SNPs for each data set, respectively, a level of accuracy never reached for these species using FST‐selected panels. Our results demonstrate a role for machine‐learning approaches in marker selection across large genomic data sets to improve assignment for management and conservation of exploited populations.

[1]  M. Lemay,et al.  Genetic evidence for ecological divergence in kokanee salmon , 2015, Molecular ecology.

[2]  Noah A. Rosenberg Algorithms for Selecting Informative Marker Panels for Population Assignment , 2005, J. Comput. Biol..

[3]  M. Ozerov,et al.  Finding Markers That Make a Difference: DNA Pooling and SNP-Arrays Identify Population Informative Markers for Genetic Stock Identification , 2013, PloS one.

[4]  L. Bernatchez,et al.  LANDSCAPE GENOMICS IN ATLANTIC SALMON (SALMO SALAR): SEARCHING FOR GENE–ENVIRONMENT INTERACTIONS DRIVING LOCAL ADAPTATION , 2013, Evolution; international journal of organic evolution.

[5]  James E. Seeb,et al.  Managing mixed-stock fisheries: genotyping multi-SNP haplotypes increases power for genetic stock identification , 2017 .

[6]  Paul Horton,et al.  Network-based de-noising improves prediction from microarray data , 2006, BMC Bioinformatics.

[7]  L. Seeb,et al.  Genetic differentiation of Alaska Chinook salmon: the missing link for migratory studies , 2011, Molecular ecology resources.

[8]  Anne-Laure Boulesteix,et al.  Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics , 2012, WIREs Data Mining Knowl. Discov..

[9]  J. Hemmer-Hansen,et al.  Application of SNPs for population genetics of nonmodel organisms: new opportunities and challenges , 2011, Molecular ecology resources.

[10]  K. Lunetta,et al.  Identifying SNPs predictive of phenotype using random forests , 2005, Genetic epidemiology.

[11]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[12]  Oscar Gaggiotti,et al.  Identifying the Environmental Factors That Determine the Genetic Structure of Populations , 2006, Genetics.

[13]  George C. Runger,et al.  Gene selection with guided regularized random forest , 2012, Pattern Recognit..

[14]  Steven T. Kalinowski,et al.  An improved method for predicting the accuracy of genetic stock identification , 2008 .

[15]  Stephanie Manel,et al.  Assignment methods: matching biological questions with appropriate techniques. , 2005, Trends in ecology & evolution.

[16]  J. Bromaghin bels: backward elimination locus selection for studies of mixture composition or individual assignment , 2008, Molecular ecology resources.

[17]  R. Beiko,et al.  Phylogenetic approaches to microbial community classification , 2015, Microbiome.

[18]  Single Nucleotide Polymorphisms Provide Rapid and Accurate Estimates of the Proportions of U.S. and Canadian Chinook Salmon Caught in Yukon River Fisheries , 2005 .

[19]  N. Jeffery,et al.  genepopedit: a simple and flexible tool for manipulating multilocus molecular data in R , 2017, Molecular ecology resources.

[20]  Anil K. Jain,et al.  A Mixture Model for Clustering Ensembles , 2004, SDM.

[21]  R. Ogden,et al.  FishPopTrace—Developing SNP-based population genetic assignment methods to investigate illegal fishing , 2009 .

[22]  Fan Meng,et al.  Open Biomedical Ontology-based Medline exploration , 2009, BMC Bioinformatics.

[23]  A. LarsonWesley,et al.  Single-nucleotide polymorphisms (SNPs) identified through genotyping-by-sequencing improve genetic stock identification of Chinook salmon (Oncorhynchus tshawytscha) from western Alaska , 2014 .

[24]  M. Banks,et al.  New tetranucleotide microsatellites for fine‐scale discrimination among endangered chinook salmon (Oncorhynchus tshawytscha) , 2003 .

[25]  S. Sparrow,et al.  COSEWIC Assessment and Status Report on the , 2009 .

[26]  M. Sköld,et al.  Population structure in Atlantic cod in the eastern North Sea-Skagerrak-Kattegat: early life stage dispersal and adult migration , 2016, BMC Research Notes.

[27]  Jean-Sébastien Moore,et al.  Conservation genomics of anadromous Atlantic salmon across its North American range: outlier loci identify the same patterns of population structure as neutral loci , 2014, Molecular ecology.

[28]  E. Anderson,et al.  Purging putative siblings from population genetic data sets: a cautionary view , 2017, Molecular ecology.

[29]  L. Bernatchez,et al.  Genetic evidence of local exploitation of Atlantic salmon in a coastal subsistence fishery in the Northwest Atlantic , 2015 .

[30]  Laurent Excoffier,et al.  Arlequin (version 3.0): An integrated software package for population genetics data analysis , 2005, Evolutionary bioinformatics online.

[31]  L. Seeb,et al.  Genotyping by sequencing resolves shallow population structure to inform conservation of Chinook salmon (Oncorhynchus tshawytscha) , 2014, Evolutionary applications.

[32]  L. Bernatchez,et al.  Genetic mixed stock analysis of an interceptory Atlantic salmon fishery in the Northwest Atlantic , 2016 .

[33]  L. Bernatchez,et al.  Transatlantic secondary contact in Atlantic Salmon, comparing microsatellites, a single nucleotide polymorphism array and restriction‐site associated DNA sequencing for the resolution of complex spatial structure , 2015, Molecular ecology.

[34]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[35]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[36]  Kotaro Ono,et al.  Integration of Random Forest with population‐based outlier analyses provides insight on the genomic basis and evolution of run timing in Chinook salmon (Oncorhynchus tshawytscha) , 2015, Molecular ecology.

[37]  L. Clausen,et al.  Gene-associated markers can assign origin in a weakly structured fish, Atlantic herring , 2015 .

[38]  D. Isaak,et al.  Fine‐scale natal homing and localized movement as shaped by sex and spawning habitat in Chinook salmon: insights from spatial autocorrelation analysis of individual genotypes , 2006, Molecular ecology.

[39]  L. Bernatchez,et al.  RAD Sequencing Highlights Polygenic Discrimination of Habitat Ecotypes in the Panmictic American Eel , 2015, Current Biology.

[40]  G. Hoarau,et al.  Genetic population structure of marine fish : mismatch between biological and fisheries management units , 2009 .

[41]  Thomas Lengauer,et al.  Classification with correlated features: unreliability of feature ranking and solutions , 2011, Bioinform..

[42]  Andrew Kusiak,et al.  Data mining and genetic algorithm based gene/SNP selection , 2004, Artif. Intell. Medicine.

[43]  E C Anderson,et al.  Assessing the power of informative subsets of loci for population assignment: standard methods are upwardly biased , 2010, Molecular ecology resources.

[44]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[45]  Panu Orell,et al.  Genetic stock identification of Atlantic salmon and its evaluation in a large population complex , 2017 .

[46]  S. Lien,et al.  Generic genetic differences between farmed and wild Atlantic salmon identified from a 7K SNP‐chip , 2011, Molecular ecology resources.

[47]  Philip D. McLoughlin,et al.  COMMITTEE ON THE STATUS OF ENDANGERED WILDLIFE IN CANADA , 2009 .

[48]  Luyao Zhan INFERRING ECOLOGICAL POPULATION STRUCTURE AND ENVIRONMENTAL ASSOCIATIONS THROUGH AUTOMATED ANALYSIS OF REPEAT-CONTAINING AND POLYMORPHIC DNA SEQUENCES , 2016 .

[49]  B. Weir,et al.  ESTIMATING F‐STATISTICS FOR THE ANALYSIS OF POPULATION STRUCTURE , 1984, Evolution; international journal of organic evolution.

[50]  L. Teixeira,et al.  Work ability and associated factors of Brazilian technical-administrative workers in education , 2016, BMC Research Notes.

[51]  Yi Yu,et al.  Performance of random forest when SNPs are in linkage disequilibrium , 2009, BMC Bioinformatics.

[52]  N. Hanson,et al.  Working group on North Atlantic salmon (WGNAS) , 2019 .

[53]  Miron B. Kursa,et al.  Robustness of Random Forest-based gene selection methods , 2013, BMC Bioinformatics.

[54]  Igor Jurisica,et al.  Knowledge Discovery and interactive Data Mining in Bioinformatics - State-of-the-Art, future challenges and research directions , 2014, BMC Bioinformatics.