Proper analysis of secondary phenotype data in case‐control association studies

Case‐control association studies often collect extensive information on secondary phenotypes, which are quantitative or qualitative traits other than the case‐control status. Exploring secondary phenotypes can yield valuable insights into biological pathways and identify genetic variants influencing phenotypes of direct interest. All publications on secondary phenotypes have used standard statistical methods, such as least‐squares regression for quantitative traits. Because of unequal selection probabilities between cases and controls, the case‐control sample is not a random sample from the general population. As a result, standard statistical analysis of secondary phenotype data can be extremely misleading. Although one may avoid the sampling bias by analyzing cases and controls separately or by including the case‐control status as a covariate in the model, the associations between a secondary phenotype and a genetic variant in the case and control groups can be quite different from the association in the general population. In this article, we present novel statistical methods that properly reflect the case‐control sampling in the analysis of secondary phenotype data. The new methods provide unbiased estimation of genetic effects and accurate control of false‐positive rates while maximizing statistical power. We demonstrate the pitfalls of the standard methods and the advantages of the new methods both analytically and numerically. The relevant software is available at our website. Genet. Epidemiol. 2009. © 2008 Wiley‐Liss, Inc.

[1]  R. Pyke,et al.  Logistic disease incidence models and case-control studies , 1979 .

[2]  N. Nagelkerke,et al.  Logistic regression in case-control studies: the effect of using independent as dependent variables. , 1995, Statistics in medicine.

[3]  K. Roeder,et al.  A Semiparametric Mixture Approach to Case-Control Studies with Errors in Covariables , 1996 .

[4]  A. Scott,et al.  Re-using data from case-control studies. , 1997, Statistics in medicine.

[5]  D. Zeng,et al.  Likelihood-Based Inference on Haplotype Effects in Genetic Association Studies , 2006 .

[6]  Yannan Jiang,et al.  Secondary analysis of case‐control data , 2006, Statistics in medicine.

[7]  J. Klenk,et al.  Analyses of Case–Control Data for Additional Outcomes , 2007, Epidemiology.

[8]  Richa Saxena,et al.  A common variant of HMGA2 is associated with adult and childhood height in the general population , 2007, Nature Genetics.

[9]  P. Kraft Analyses of genome-wide association scans for additional outcomes. , 2007, Epidemiology.

[10]  M. Jarvelin,et al.  A Common Variant in the FTO Gene Is Associated with Body Mass Index and Predisposes to Childhood and Adult Obesity , 2007, Science.

[11]  Marcia M. Nizzari,et al.  Genome-Wide Association Analysis Identifies Loci for Type 2 Diabetes and Triglyceride Levels , 2007, Science.

[12]  Dolores Corella,et al.  Six new loci associated with blood low-density lipoprotein cholesterol, high-density lipoprotein cholesterol or triglycerides in humans , 2008, Nature Genetics.

[13]  C. Gieger,et al.  Identification of ten loci associated with height highlights new biological pathways in human growth , 2008, Nature Genetics.

[14]  David M. Evans,et al.  Genome-wide association analysis identifies 20 loci that influence adult height , 2008, Nature Genetics.

[15]  Subhajyoti De,et al.  Common variants near MC4R are associated with fat mass, weight and risk of obesity , 2008, Nature Genetics.

[16]  Bjarni V. Halldórsson,et al.  Many sequence variants affecting diversity of adult human height , 2008, Nature Genetics.

[17]  R. Collins,et al.  Newly identified loci that influence lipid concentrations and risk of coronary artery disease , 2008, Nature Genetics.

[18]  Shah Ebrahim,et al.  Common variants in the GDF5-UQCC region are associated with variation in human height , 2008, Nature Genetics.