Imputation Methods to Improve Inference in Snp Association Studies

Missing single nucleotide polymorphisms (SNPs) are quite common in genetic association studies. Subjects with missing SNPs are often discarded in analyses, which may seriously undermine the inference of SNP‐disease association. In this article, we develop two haplotype‐based imputation approaches and one tree‐based imputation approach for association studies. The emphasis is to evaluate the impact of imputation on parameter estimation, compared to the standard practice of ignoring missing data. Haplotype‐based approaches build on haplotype reconstruction by the expectation‐maximization (EM) algorithm or a weighted EM (WEM) algorithm, depending on whether case‐control status is taken into account. The tree‐based approach uses a Gibbs sampler to iteratively sample from a full conditional distribution, which is obtained from the classification and regression tree (CART) algorithm. We employ a standard multiple imputation procedure to account for the uncertainty of imputation. We apply the methods to simulated data as well as a case‐control study on developmental dyslexia. Our results suggest that imputation generally improves efficiency over the standard practice of ignoring missing data. The tree‐based approach performs comparably well as haplotype‐based approaches, but the former has a computational advantage. The WEM approach yields the smallest bias at a price of increased variance. Genet. Epidemiol. © 2006 Wiley‐Liss, Inc.

[1]  Zhaohui S. Qin,et al.  Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. , 2002, American journal of human genetics.

[2]  M. Stephens,et al.  Accounting for Decay of Linkage Disequilibrium in Haplotype Inference and Missing-data Imputation , 2022 .

[3]  D. Schaid,et al.  Score tests for association between traits and haplotypes when linkage phase is ambiguous. , 2002, American journal of human genetics.

[4]  S Greenland,et al.  A critical look at methods for handling missing covariates in epidemiologic regression analyses. , 1995, American journal of epidemiology.

[5]  Peter Holmans,et al.  Strong evidence that KIAA0319 on chromosome 6p is a susceptibility gene for developmental dyslexia. , 2005, American journal of human genetics.

[6]  Daniel O. Stram,et al.  Modeling and E-M Estimation of Haplotype-Specific Relative Risks from Genotype Data for a Case-Control Study of Unrelated Individuals , 2003, Human Heredity.

[7]  D. Clayton,et al.  A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. , 2002, American journal of human genetics.

[8]  W. Wong,et al.  The calculation of posterior distributions by data augmentation , 1987 .

[9]  Zhaohui S. Qin,et al.  Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. , 2002, American journal of human genetics.

[10]  I. Ruczinski,et al.  Polymorphisms of the DNA repair genes XPD (Lys751Gln) and XRCC1 (Arg399Gln and Arg194Trp): relationship to breast cancer risk and familial predisposition to breast cancer , 2005, Breast Cancer Research and Treatment.

[11]  Jing Huang,et al.  Dynamic model based algorithms for screening and genotyping over 100K SNPs on oligonucleotide microarrays , 2005, Bioinform..

[12]  P. Donnelly,et al.  A new statistical method for haplotype reconstruction from population data. , 2001, American journal of human genetics.

[13]  N. Risch Linkage strategies for genetically complex traits. II. The power of affected relative pairs. , 1990, American journal of human genetics.

[14]  Lue Ping Zhao,et al.  A method for the assessment of disease associations with single-nucleotide polymorphism haplotypes and environmental variables in case-control studies. , 2003, American journal of human genetics.

[15]  Nan Hu,et al.  Genome-wide association study in esophageal cancer using GeneChip mapping 10K array. , 2005, Cancer research.

[16]  N. Risch Searching for genetic determinants in the new millennium , 2000, Nature.

[17]  G. Colditz,et al.  A functional polymorphism in the promoter of the progesterone receptor gene associated with endometrial cancer risk , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[18]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[19]  BRLMM : an Improved Genotype Calling Method for the GeneChip ® Human Mapping 500 K Array Set , 2006 .

[20]  Heping Zhang,et al.  Use of classification trees for association studies , 2000, Genetic epidemiology.

[21]  Peter Kraft,et al.  Accounting for haplotype uncertainty in matched association studies: A comparison of simple and flexible techniques , 2005, Genetic epidemiology.

[22]  A. Chakravarti,et al.  Haplotype inference in random population samples. , 2002, American journal of human genetics.

[23]  N. Risch Linkage strategies for genetically complex traits. I. Multilocus models. , 1990, American journal of human genetics.

[24]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[25]  Ingo Ruczinski,et al.  Logic Regression — Methods and Software , 2003 .

[26]  L. Excoffier,et al.  Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. , 1995, Molecular biology and evolution.

[27]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[28]  N. Laird,et al.  Estimation and Tests of Haplotype-Environment Interaction when Linkage Phase Is Ambiguous , 2003, Human Heredity.

[29]  Yuri M. Svirezhev,et al.  Multi-Locus Models , 1990 .

[30]  K. Lunetta,et al.  Identifying SNPs predictive of phenotype using random forests , 2005, Genetic epidemiology.

[31]  D. Botstein,et al.  Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease , 2003, Nature Genetics.

[32]  Terence P. Speed,et al.  Genome analysis A genotype calling algorithm for affymetrix SNP arrays , 2005 .

[33]  F. Harrell,et al.  Regression modelling strategies for improved prognostic prediction. , 1984, Statistics in medicine.

[34]  G. Satten,et al.  Inference on haplotype effects in case-control studies using unphased genotype data. , 2003, American journal of human genetics.

[35]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[36]  Daniel E Weeks,et al.  Candidate-gene screening and association analysis at the autism-susceptibility locus on chromosome 16p: evidence of association at GRIN2A and ABAT. , 2005, American journal of human genetics.