Stratification‐Score Matching Improves Correction for Confounding by Population Stratification in Case‐Control Association Studies

Proper control of confounding due to population stratification is crucial for valid analysis of case‐control association studies. Fine matching of cases and controls based on genetic ancestry is an increasingly popular strategy to correct for such confounding, both in genome‐wide association studies (GWASs) as well as studies that employ next‐generation sequencing, where matching can be used when selecting a subset of participants from a GWAS for rare‐variant analysis. Existing matching methods match on measures of genetic ancestry that combine multiple components of ancestry into a scalar quantity. However, we show that including nonconfounding ancestry components in a matching criterion can lead to inaccurate matches, and hence to an improper control of confounding. To resolve this issue, we propose a novel method that assigns cases and controls to matched strata based on the stratification score (Epstein et al. [2007] Am J Hum Genet 80:921–930), which is the probability of disease given genomic variables. Matching on the stratification score leads to more accurate matches because case participants are matched to control participants who have a similar risk of disease given ancestry information. We illustrate our matching method using the African‐American arm of the GAIN GWAS of schizophrenia. In this study, we observe that confounding due to stratification can be resolved by our matching approach but not by other existing matching procedures. We also use simulated data to show our novel matching approach can provide a more appropriate correction for population stratification than existing matching approaches.

[1]  Agner Fog,et al.  Sampling Methods for Wallenius' and Fisher's Noncentral Hypergeometric Distributions , 2008, Commun. Stat. Simul. Comput..

[2]  D. Rubin,et al.  Constructing a Control Group Using Multivariate Matched Sampling Methods That Incorporate the Propensity Score , 1985 .

[3]  Ann B. Lee,et al.  On the use of general control samples for genome-wide association studies: genetic matching highlights causal variants. , 2008, American journal of human genetics.

[4]  B. Hansen Full Matching in an Observational Study of Coaching for the SAT , 2004 .

[5]  W. Willett,et al.  A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer , 2007, Nature Genetics.

[6]  SNPs in CAST are associated with Parkinson disease: A confirmation study , 2010, American journal of medical genetics. Part B, Neuropsychiatric genetics : the official publication of the International Society of Psychiatric Genetics.

[7]  M. J. Fresnadillo Martínez,et al.  Common variants at 7p21 are associated with frontotemporal lobar degeneration with TDP-43 inclusions , 2010, Nature Genetics.

[8]  Ingo Ruczinski,et al.  Genome-wide association analysis identifies PDE4D as an asthma-susceptibility gene. , 2009, American journal of human genetics.

[9]  P. Donnelly,et al.  New models of collaboration in genome-wide association studies: the Genetic Association Information Network , 2007, Nature Genetics.

[10]  Xiaolin Zhu,et al.  Qualitative Semi‐Parametric Test for Genetic Associations in Case‐Control Designs Under Structured Populations , 2003, Annals of human genetics.

[11]  S. Browning,et al.  A Groupwise Association Test for Rare Mutations Using a Weighted Sum Statistic , 2009, PLoS genetics.

[12]  Fei Zou,et al.  Comment on a simple and improved correction for population stratification. , 2008, American journal of human genetics.

[13]  S. Leal,et al.  Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. , 2008, American journal of human genetics.

[14]  Marcia M. Nizzari,et al.  Genome-Wide Association Analysis Identifies Loci for Type 2 Diabetes and Triglyceride Levels , 2007, Science.

[15]  Lester L. Peters,et al.  Genome-wide association study identifies novel breast cancer susceptibility loci , 2007, Nature.

[16]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[17]  G. Abecasis,et al.  A Genome-Wide Association Study of Type 2 Diabetes in Finns Detects Multiple Susceptibility Variants , 2007, Science.

[18]  Elizabeth L. Ogburn,et al.  Demonstrating stratification in a European American population , 2005, Nature Genetics.

[19]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[20]  Iuliana Ionita-Laza,et al.  A New Testing Strategy to Identify Rare Variants with Either Risk or Protective Effect on Disease , 2011, PLoS genetics.

[21]  Weihua Guan,et al.  Genotype‐based matching to correct for population stratification in large‐scale case‐control genetic association studies , 2009, Genetic epidemiology.

[22]  Ann B. Lee,et al.  Discovering genetic ancestry using spectral graph theory , 2009, Genetic epidemiology.

[23]  G. Satten,et al.  Effect of population stratification on the identification of significant single-nucleotide polymorphisms in genome-wide association studies , 2009, BMC proceedings.

[24]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[25]  Michael P Epstein,et al.  Score‐based adjustment for confounding by population stratification in genetic association studies , 2010, Genetic epidemiology.

[26]  Kathryn Roeder,et al.  Ulcerative colitis loci on chromosomes 1 p 36 and 12 q 15 identified by genome-wide association study , 2009 .

[27]  Yun Li,et al.  To identify associations with rare variants, just WHaIT: Weighted haplotype and imputation-based tests. , 2010, American journal of human genetics.

[28]  D. Rubin,et al.  Reducing Bias in Observational Studies Using Subclassification on the Propensity Score , 1984 .

[29]  Paola Dazzan,et al.  Heterogeneity in incidence rates of schizophrenia and other psychotic syndromes: findings from the 3-center AeSOP study. , 2006, Archives of general psychiatry.

[30]  W. Willett,et al.  A genome-wide association study identifies alleles in FGFR 2 associated with risk of sporadic postmenopausal breast cancer , 2012 .

[31]  N. L. Johnson,et al.  Linear Statistical Inference and Its Applications , 1966 .

[32]  R. Williams,et al.  Gm3;5,13,14 and type 2 diabetes mellitus: an association in American Indians with genetic admixture. , 1988, American journal of human genetics.

[33]  Michael P Epstein,et al.  A simple and improved correction for population stratification in case-control studies. , 2007, American journal of human genetics.

[34]  Scott M. Williams,et al.  The Genetic Structure and History of Africans and African Americans , 2009, Science.

[35]  G. Satten,et al.  A novel haplotype‐sharing approach for genome‐wide case‐control association studies implicates the calpastatin gene in Parkinson's disease , 2009, Genetic epidemiology.

[36]  G. Satten,et al.  Control for confounding in case-control studies using the stratification score, a retrospective balancing score. , 2011, American journal of epidemiology.

[37]  Jacques Fellay,et al.  A Whole-Genome Association Study of Major Determinants for Host Control of HIV-1 , 2007, Science.

[38]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[39]  John Novembre,et al.  The Population Reference Sample, POPRES: a resource for population, disease, and pharmacological genetics research. , 2008, American journal of human genetics.

[40]  T. Hudson,et al.  A genome-wide association study identifies novel risk loci for type 2 diabetes , 2007, Nature.

[41]  P. Rosenbaum A Characterization of Optimal Designs for Observational Studies , 1991 .

[42]  Genome-wide association analysis of rheumatoid arthritis data via haplotype sharing , 2009, BMC proceedings.