Controlling for human population stratification in rare variant association studies

Population stratification is a strong confounding factor in human genetic association studies. In analyses of rare variants, the main correction strategies based on principal components (PC) and linear mixed models (LMM), may yield conflicting conclusions, due to both the specific type of structure induced by rare variants and the particular statistical features of association tests. Studies evaluating these approaches generally focused on specific situations with limited types of simulated structure and large sample sizes. We investigated the properties of several correction methods in the context of a large simulation study using real exome data, and several within- and between- continent stratification scenarios. We also considered different sample sizes, with situations including as few as 50 cases, to account for the analysis of rare disorders. In this context, we focused on a genetic model with a phenotype driven by rare deleterious variants well suited for a burden test. For analyses of large samples, we found that accounting for stratification was more difficult with a continental structure than with a worldwide structure. LMM failed to maintain a correct type I error in many scenarios, whereas PCs based on common variants failed only in the presence of extreme continental stratification. When a sample of 50 cases was considered, an inflation of type I errors was observed with PC for small numbers of controls (≤100), and with LMM for large numbers of controls (≥1000). We also tested a promising novel adapted local permutation method (LocPerm), which maintained a correct type I error in all situations. All approaches capable of correcting for stratification properly had similar powers for detecting actual associations pointing out that the key issue is to properly control type I errors. Finally, we found that adding a large panel of external controls (e.g. extracted from publicly available databases) was an efficient way to increase the power of analyses including small numbers of cases, provided an appropriate stratification correction was used. Author Summary Genetic association studies focusing on rare variants using next generation sequencing (NGS) data have become a common strategy to overcome the shortcomings of classical genome-wide association studies for the analysis of rare and common diseases. The issue of population stratification remains however a substantial question that has not been fully resolved when analyzing NGS data. In this work, we propose a comprehensive evaluation of the main strategies to account for stratification, that are principal components and linear mixed model, along with a novel approach based on local permutations (LocPerm). We compared these correction methods in many different settings, considering several types of population structures, sample sizes or types of variants. Our results highlighted important limitations of some classical methods as those using principal components (in particular in small samples) and linear mixed models (in several situations). In contrast, LocPerm maintained a correct type I error in all situations. Also, we showed that adding a large panel of external controls, e.g coming from publicly available databases, is an efficient strategy to increase the power of an analysis including a low number of cases, as long as an appropriate stratification correction is used. Our findings provide helpful guidelines for many researchers working on rare variant association studies.

[1]  Omar De la Cruz,et al.  Population structure at different minor allele frequency levels , 2014, BMC Proceedings.

[2]  Michael P. Epstein,et al.  Assessing the Impact of Population Stratification on Association Studies of Rare Variation , 2013, Human Heredity.

[3]  G. McVean,et al.  Differential confounding of rare and common variants in spatially structured populations , 2011, Nature Genetics.

[4]  Chaolong Wang,et al.  Ancestry estimation and control of population stratification for sequence-based association studies , 2014, Nature Genetics.

[5]  Xihong Lin,et al.  Rare-variant association testing for sequencing data with the sequence kernel association test. , 2011, American journal of human genetics.

[6]  Lin S. Chen,et al.  Marbled Inflation From Population Structure in Gene‐Based Association Studies With Rare Variants , 2013, Genetic epidemiology.

[7]  P. Elliott,et al.  UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age , 2015, PLoS medicine.

[8]  M. Stephens,et al.  Genome-wide Efficient Mixed Model Analysis for Association Studies , 2012, Nature Genetics.

[9]  G. Abecasis,et al.  Rare-variant association analysis: study designs and statistical tests. , 2014, American journal of human genetics.

[10]  J. Casanova,et al.  Tuberculosis and impaired IL-23–dependent IFN-γ immunity in humans homozygous for a common TYK2 missense variant , 2018, Science Immunology.

[11]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[12]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[13]  Wei Pan,et al.  Adjustment for Population Stratification via Principal Components in Association Analysis of Rare Variants , 2013, Genetic epidemiology.

[14]  Lisa J. Martin,et al.  Population structure analysis using rare and common functional variants , 2011, BMC proceedings.

[15]  J. Tzeng,et al.  On the substructure controls in rare variant analysis: Principal components or variance components? , 2018, Genetic epidemiology.

[16]  P. Good,et al.  Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses , 1995 .

[17]  Lars G Fritsche,et al.  Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies , 2017, Nature Genetics.

[18]  Adam Kiezun,et al.  Fine-Scale Patterns of Population Stratification Confound Rare Variant Association Tests , 2013, PloS one.

[19]  Qiuying Sha,et al.  A Nonparametric Regression Approach to Control for Population Stratification in Rare Variant Association Studies , 2016, Scientific Reports.

[20]  Xiaotong Shen,et al.  Adjusting for Population Stratification in a Fine Scale With Principal Components and Sequencing Data , 2013, Genetic epidemiology.

[21]  A. Morris,et al.  Data quality control in genetic case-control association studies , 2010, Nature Protocols.

[22]  Tom R. Gaunt,et al.  The UK10K project identifies rare variants in health and disease , 2016 .

[23]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[24]  H. Kang,et al.  Variance component model to account for sample structure in genome-wide association studies , 2010, Nature Genetics.

[25]  R. Redon,et al.  The impact of a fine-scale population stratification on rare variant association test results , 2018, PloS one.

[26]  David Heckerman,et al.  FaST-LMM-Select for addressing confounding from spatial structure and rare variants , 2013, Nature Genetics.

[27]  P. E. Rudolph Good, Ph.: Permutation Tests. A Practical Guide to Resampling Methods for Testing Hypotheses. Springer Series in Statistics, Springer‐Verlag, Berlin — Heidelberg — New York: 1994, x, 228 pp., DM 74,00; ōS 577.20; sFr 74.–. ISBN 3‐540‐94097‐9 , 1995 .

[28]  J. Novembre,et al.  Analysis of rare variant population structure in Europeans explains differential stratification of gene-based tests , 2014, European Journal of Human Genetics.

[29]  W. Pan,et al.  Principal Component Regression and Linear Mixed Model in Association Analysis of Structured Samples: Competitors or Complements? , 2015, Genetic epidemiology.

[30]  G. Lettre,et al.  Rare variant association studies: considerations, challenges and opportunities , 2015, Genome Medicine.

[31]  Douglas N. Rutledge,et al.  Rare and Low Frequency Variant Stratification in the UK Population: Description and Impact on Association Tests , 2012, PloS one.

[32]  W. Thilly,et al.  A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST). , 2007, Mutation research.

[33]  Marylyn D. Ritchie,et al.  Low Frequency Variants, Collapsed Based on Biological Knowledge, Uncover Complexity of Population Stratification in 1000 Genomes Project Data , 2013, PLoS genetics.

[34]  Lei Shang,et al.  Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants , 2014, Proceedings of the National Academy of Sciences.

[35]  Eleazar Eskin,et al.  Improved linear mixed models for genome-wide association studies , 2012, Nature Methods.

[36]  Josyf Mychaleckyj,et al.  Robust relationship inference in genome-wide association studies , 2010, Bioinform..

[37]  Kyle J. Gaulton,et al.  The Power of Gene-Based Rare Variant Methods to Detect Disease-Associated Variation and Test Hypotheses About Complex Disease , 2015, PLoS genetics.

[38]  Xinyuan Zhang,et al.  Real world scenarios in rare variant association analysis: the impact of imbalance and sample size on the power in silico , 2018, BMC Bioinformatics.

[39]  Ying Liu,et al.  FaST linear mixed models for genome-wide association studies , 2011, Nature Methods.

[40]  J. Casanova,et al.  Taking population stratification into account by local permutations in rare-variant association studies on small samples , 2020, bioRxiv.