Regarding the F-word: the effects of data Filtering on inferred genotype-environment associations

Genotype-environment association (GEA) methods have become part of the standard landscape genomics toolkit, yet, we know little about how to filter genotype-by-sequencing data to provide robust inferences for environmental adaptation. In many cases, default filtering thresholds for minor allele frequency and missing data are applied regardless of sample size, having unknown impacts on the results. These effects could be amplified in downstream predictions, including management strategies. Here, we investigate the effects of filtering on GEA results and the potential implications for adaptation to environment. Using empirical and simulated datasets derived from two widespread tree species to assess the effects of filtering on GEA outputs. Critically, we find that the level of filtering of missing data and minor allele frequency affect the identification of true positives. Even slight adjustments to these thresholds can change the rate of true positive detection. Using conservative thresholds for missing data and minor allele frequency substantially reduces the size of the dataset, lessening the power to detect adaptive variants (i.e. simulated true positives) with strong and weak strength of selections. Regardless, strength of selection was a good predictor for GEA detection, but even SNPs under strong selection went undetected. We further show that filtering can significantly impact the predictions of adaptive capacity of species in downstream analyses. We make several recommendations regarding filtering for GEA methods. Ultimately, there is no filtering panacea, but some choices are better than others, depending largely on the study system, availability of genomic resources, and desired objectives of the study.

[1]  Felix Gugerli,et al.  A practical guide to environmental association analysis in landscape genomics , 2015, Molecular ecology.

[2]  M. Byrne,et al.  Predicting contemporary range‐wide genomic variation using climatic, phylogeographic and morphological knowledge in an ancient, unglaciated landscape , 2019, Journal of Biogeography.

[3]  Paul Sunnucks,et al.  Genomics in Conservation: Case Studies and Bridging the Gap between Data and Application. , 2016, Trends in ecology & evolution.

[4]  B. Emerson,et al.  Restriction site‐associated DNA sequencing, genotyping error estimation and de novo assembly optimization for population genetic inference , 2015, Molecular ecology resources.

[5]  C. Battey,et al.  Minor allele frequency thresholds strongly affect population structure inference with genomic data sets , 2019, Molecular ecology resources.

[6]  C. Dreyer,et al.  Estimates of Genetic Differentiation Measured by FST Do Not Necessarily Require Large Sample Sizes When Using Many SNP Markers , 2012, PloS one.

[7]  R. Klein,et al.  Power analysis for genome-wide association studies , 2007, BMC Genetics.

[8]  L. Lohmann,et al.  Minimum sample sizes for population genomics: an empirical study from an Amazonian plant species , 2017, Molecular ecology resources.

[9]  Peter Holmans,et al.  Effects of Differential Genotyping Error Rate on the Type I Error Probability of Case-Control Studies , 2006, Human Heredity.

[10]  Aaron B. A. Shafer,et al.  Bioinformatic processing of RAD‐seq data dramatically impacts downstream population genetic inference , 2017 .

[11]  R. Nielsen,et al.  Population genetic inference from genomic sequence variation. , 2010, Genome research.

[12]  C. Külheim,et al.  A dated molecular perspective of eucalypt taxonomy, evolution and diversification , 2019, Australian Systematic Botany.

[13]  Brenna R. Forester,et al.  Comparing methods for detecting multilocus adaptation with multivariate genotype-environment associations , 2017, bioRxiv.

[14]  O. François,et al.  Controlling false discoveries in genome scans for selection , 2016, Molecular ecology.

[15]  M. Spitz,et al.  Shifting paradigm of association studies: value of rare single-nucleotide polymorphisms. , 2008, American journal of human genetics.

[16]  Brenna R. Forester,et al.  Considering adaptive genetic variation in climate change vulnerability assessment reduces species range loss projections , 2019, Proceedings of the National Academy of Sciences.

[17]  Sara M. Schaal,et al.  Composite measures of selection can improve the signal‐to‐noise ratio in genome scans , 2017 .

[18]  M. Whitlock,et al.  Evaluation of demographic history and neutral parameterization on the performance of FST outlier tests , 2014, Molecular ecology.

[19]  S. Prober,et al.  Evidence of genomic adaptation to climate in Eucalyptus microcarpa: Implications for adaptive potential to projected climate change , 2017, Molecular ecology.

[20]  G. Luikart,et al.  Recent novel approaches for population genomics data analysis , 2014, Molecular ecology.

[21]  R. Harrigan,et al.  Genomic signals of selection predict climate-driven population declines in a migratory bird , 2018, Science.

[22]  T. Cezard,et al.  The effect of RAD allele dropout on the estimation of genetic variation within and between populations , 2013, Molecular ecology.

[23]  Joanna L. Kelley,et al.  Breaking RAD: an evaluation of the utility of restriction site‐associated DNA sequencing for genome scans of adaptation , 2016, Molecular ecology resources.

[24]  Stephen E. Fick,et al.  WorldClim 2: new 1‐km spatial resolution climate surfaces for global land areas , 2017 .

[25]  Damaris Zurell,et al.  Collinearity: a review of methods to deal with it and a simulation study evaluating their performance , 2013 .

[26]  Yong-Bi Fu,et al.  Genetic Diversity Analysis of Highly Incomplete SNP Genotype Data with Imputations: An Empirical Assessment , 2014, G3: Genes, Genomes, Genetics.

[27]  S. Narum,et al.  Genotyping‐by‐sequencing in ecological and conservation genomics , 2013, Molecular ecology.

[28]  J. F. Storz,et al.  INVITED REVIEW: Using genome scans of DNA polymorphism to infer adaptive population divergence , 2005, Molecular ecology.

[29]  Brian K. Hand,et al.  Recent advances in conservation and population genomics data analysis , 2018, Evolutionary Applications.

[30]  V. Sork Genomic Studies of Local Adaptation in Natural Plant Populations. , 2017, The Journal of heredity.

[31]  S. Fitz-Gibbon,et al.  Adaptational lag to temperature in valley oak (Quercus lobata) can be mitigated by genome-informed assisted gene flow , 2019, Proceedings of the National Academy of Sciences.

[32]  G. McVean Population Genetic Inference , 2002 .

[33]  Lisa J. Martin,et al.  The effect of minor allele frequency on the likelihood of obtaining false positives , 2009, BMC Proceedings.

[34]  P. Donnelly,et al.  Designing Genome-Wide Association Studies: Sample Size, Power, Imputation, and the Choice of Genotyping Chip , 2009, PLoS genetics.

[35]  Alain F. Zuur,et al.  A protocol for data exploration to avoid common statistical problems , 2010 .

[36]  P. Dear,et al.  Comparative Genome Analysis Reveals Divergent Genome Size Evolution in a Carnivorous Plant Genus , 2015, The plant genome.

[37]  G. Coop,et al.  Robust Identification of Local Adaptation from Allele Frequencies , 2012, Genetics.

[38]  Gideon S. Bradburd,et al.  Finding the Genomic Basis of Local Adaptation: Pitfalls, Practical Solutions, and Future Directions , 2016, The American Naturalist.

[39]  S. Manel,et al.  Genomic resources and their influence on the detection of the signal of positive selection in genome scans , 2016, Molecular ecology.

[40]  M. Blum,et al.  Pcadapt: An R Package to Perform Genome Scans for Selection Based on Principal Component Analysis , 2016, bioRxiv.

[41]  Eun Pyo Hong,et al.  Sample Size and Statistical Power Calculation in Genetic Association Studies , 2012, Genomics & informatics.

[42]  O. François,et al.  LFMM 2: Fast and Accurate Inference of Gene-Environment Associations in Genome-Wide Studies , 2019, Molecular biology and evolution.

[43]  R. J. Dyer,et al.  Putting the landscape into the genomics of trees: approaches for understanding local adaptation and population responses to changing climate , 2013, Tree Genetics & Genomes.

[44]  M. Whitlock,et al.  The relative power of genome scans to detect local adaptation depends on sampling design and statistical method , 2015, Molecular ecology.

[45]  J. Borevitz,et al.  Spatial, climate and ploidy factors drive genomic diversity and resilience in the widespread grass Themeda triandra , 2020, Molecular ecology.

[46]  L. Meester,et al.  The role of selection in driving landscape genomic structure of the waterflea Daphnia magna , 2013, Molecular ecology.

[47]  M. Byrne,et al.  Standing genomic variation within coding and regulatory regions contributes to the adaptive capacity to climate in a foundation tree species , 2019, Molecular ecology.

[48]  M. Gautier Genome-Wide Scan for Adaptive Divergence and Association with Population-Specific Covariates , 2015, Genetics.

[49]  P. Trivedi,et al.  Comparative Transcriptional Profiling of Contrasting Rice Genotypes Shows Expression Differences during Arsenic Stress , 2015, The plant genome.

[50]  Philipp W. Messer,et al.  Evaluating the performance of selection scans to detect selective sweeps in domestic dogs , 2015, bioRxiv.

[51]  David B. Witonsky,et al.  Using Environmental Correlations to Identify Loci Underlying Local Adaptation , 2010, Genetics.

[52]  Tanya G. Bailey,et al.  Climate adaptation and ecological restoration in eucalypts , 2016 .

[53]  S. Aitken,et al.  The challenge of separating signatures of local adaptation from those of isolation by distance and colonization history: The case of two white pines , 2016, Ecology and evolution.

[54]  Tanya G. Bailey,et al.  Temperature and Rainfall Are Separate Agents of Selection Shaping Population Differentiation in a Forest Tree , 2019, Forests.

[55]  P. Meirmans Seven common mistakes in population genetics and how to avoid them , 2015, Molecular ecology.

[56]  A. Lowe,et al.  Building evolutionary resilience for conserving biodiversity under climate change , 2010, Evolutionary applications.

[57]  Collin W Ahrens,et al.  The search for loci under selection: trends, biases and progress , 2018, Molecular ecology.

[58]  Nourollah Ahmadi,et al.  Detecting selection along environmental gradients: analysis of eight methods and their effectiveness for outbreeding and selfing populations , 2013, Molecular ecology.

[59]  Naiara Rodríguez-Ezpeleta,et al.  Selecting RAD-Seq Data Analysis Parameters for Population Genetics: The More the Better? , 2019, Front. Genet..

[60]  B. Taylor,et al.  Assessing statistical power of SNPs for population structure and conservation studies , 2009, Molecular ecology resources.

[61]  J. Puritz,et al.  These aren’t the loci you’e looking for: Principles of effective SNP filtering for molecular ecologists , 2018, Molecular ecology.