Bias Characterization in Probabilistic Genotype Data and Improved Signal Detection with Multiple Imputation

Missing data are an unavoidable component of modern statistical genetics. Different array or sequencing technologies cover different single nucleotide polymorphisms (SNPs), leading to a complicated mosaic pattern of missingness where both individual genotypes and entire SNPs are sporadically absent. Such missing data patterns cannot be ignored without introducing bias, yet cannot be inferred exclusively from nonmissing data. In genome-wide association studies, the accepted solution to missingness is to impute missing data using external reference haplotypes. The resulting probabilistic genotypes may be analyzed in the place of genotype calls. A general-purpose paradigm, called Multiple Imputation (MI), is known to model uncertainty in many contexts, yet it is not widely used in association studies. Here, we undertake a systematic evaluation of existing imputed data analysis methods and MI. We characterize biases related to uncertainty in association studies, and find that bias is introduced both at the imputation level, when imputation algorithms generate inconsistent genotype probabilities, and at the association level, when analysis methods inadequately model genotype uncertainty. We find that MI performs at least as well as existing methods or in some cases much better, and provides a straightforward paradigm for adapting existing genotype association methods to uncertain data.

[1]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[2]  D. Reich,et al.  Sensitive Detection of Chromosomal Segments of Distinct Ancestry in Admixed Populations , 2009, PLoS genetics.

[3]  J. Schafer,et al.  Multiple Imputation in Two Stages , 2003 .

[4]  Donghyung Lee,et al.  DISTMIX: direct imputation of summary statistics for unmeasured SNPs from mixed ethnicity cohorts , 2015, Bioinform..

[5]  Ross M. Fraser,et al.  Defining the role of common variation in the genomic and biological architecture of adult human height , 2014, Nature Genetics.

[6]  H. Kang,et al.  Variance component model to account for sample structure in genome-wide association studies , 2010, Nature Genetics.

[7]  Kathryn Roeder,et al.  Testing for an Unusual Distribution of Rare Variants , 2011, PLoS genetics.

[8]  Brian L Browning,et al.  Genotype Imputation with Millions of Reference Samples. , 2016, American journal of human genetics.

[9]  G. Abecasis,et al.  MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes , 2010, Genetic epidemiology.

[10]  J. Graham,et al.  How Many Imputations are Really Needed? Some Practical Clarifications of Multiple Imputation Theory , 2007, Prevention Science.

[11]  Yun Li,et al.  METAL: fast and efficient meta-analysis of genomewide association scans , 2010, Bioinform..

[12]  P. Donnelly,et al.  A new multipoint method for genome-wide association studies by imputation of genotypes , 2007, Nature Genetics.

[13]  J. Marchini,et al.  Fast and accurate genotype imputation in genome-wide association studies through pre-phasing , 2012, Nature Genetics.

[14]  Gonçalo R. Abecasis,et al.  Minimac2: Faster Genotype Imputation , 2015, Bioinform..

[15]  K. Sirotkin,et al.  The NCBI dbGaP database of genotypes and phenotypes , 2007, Nature Genetics.

[16]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[17]  Ross M. Fraser,et al.  Genetic studies of body mass index yield new insights for obesity biology , 2015, Nature.

[18]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[19]  Sharon R. Browning,et al.  Missing data imputation and haplotype phase inference for genome-wide association studies , 2008, Human Genetics.

[20]  Elif F. Acar,et al.  A Generalized Kruskal–Wallis Test Incorporating Group Uncertainty with Application to Genetic Association Studies , 2012, Biometrics.

[21]  Michael Krawczak,et al.  A comprehensive evaluation of SNP genotype imputation , 2009, Human Genetics.

[22]  Hong-Wen Deng,et al.  Analyses and Comparison of Accuracy of Different Genotype Imputation Methods , 2008, PloS one.

[23]  Donghyung Lee,et al.  DIST: direct imputation of summary statistics for unmeasured SNPs , 2013, Bioinform..

[24]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[25]  Gaurav Bhatia,et al.  Fast and accurate imputation of summary statistics enhances evidence of functional enrichment , 2013, Bioinform..

[26]  O. Delaneau,et al.  A linear complexity phasing method for thousands of genomes , 2011, Nature Methods.

[27]  P. Donnelly,et al.  A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies , 2009, PLoS genetics.

[28]  I. Pe’er,et al.  Ultrafast genome-wide scan for SNP–SNP interactions in common complex disease , 2012, Genome research.

[29]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[30]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[31]  Li Shen,et al.  The effect of reference panels and software tools on genotype imputation. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[32]  G. Abecasis,et al.  Merlin—rapid analysis of dense genetic maps using sparse gene flow trees , 2002, Nature Genetics.