Multiple testing in genome-wide association studies via hidden Markov models

MOTIVATION Genome-wide association studies (GWAS) interrogate common genetic variation across the entire human genome in an unbiased manner and hold promise in identifying genetic variants with moderate or weak effect sizes. However, conventional testing procedures, which are mostly P-value based, ignore the dependency and therefore suffer from loss of efficiency. The goal of this article is to exploit the dependency information among adjacent single nucleotide polymorphisms (SNPs) to improve the screening efficiency in GWAS. RESULTS We propose to model the linear block dependency in the SNP data using hidden Markov models (HMMs). A compound decision-theoretic framework for testing HMM-dependent hypotheses is developed. We propose a powerful data-driven procedure [pooled local index of significance (PLIS)] that controls the false discovery rate (FDR) at the nominal level. PLIS is shown to be optimal in the sense that it has the smallest false negative rate (FNR) among all valid FDR procedures. By re-ranking significance for all SNPs with dependency considered, PLIS gains higher power than conventional P-value based methods. Simulation results demonstrate that PLIS dominates conventional FDR procedures in detecting disease-associated SNPs. Our method is applied to analysis of the SNP data from a GWAS of type 1 diabetes. Compared with the Benjamini-Hochberg (BH) procedure, PLIS yields more accurate results and has better reproducibility of findings. CONCLUSION The genomic rankings based on our procedure are substantially different from the rankings based on the P-values. By integrating information from adjacent locations, the PLIS rankings benefit from the increased signal-to-noise ratio, hence our procedure often has higher statistical power and better reproducibility. It provides a promising direction in large-scale GWAS. AVAILABILITY An R package PLIS has been developed to implement the PLIS procedure. Source codes are available upon request and will be available on CRAN (http://cran.r-project.org/). CONTACT zhiwei@njit.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  Rory A. Fisher,et al.  Statistical Methods for Research Workers. , 1956 .

[3]  B. Efron SIMULTANEOUS INFERENCE : WHEN SHOULD HYPOTHESIS TESTING PROBLEMS BE COMBINED? , 2008, 0803.3863.

[4]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[5]  S. Zeger,et al.  A Smooth Nonparametric Estimate of a Mixing Distribution Using Mixtures of Gaussians , 1996 .

[6]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[7]  M. Boehnke,et al.  So many correlated tests, so little time! Rapid adjustment of P values for multiple correlated tests. , 2007, American journal of human genetics.

[8]  Joseph T. Glessner,et al.  Follow-Up Analysis of Genome-Wide Association Data Identifies Novel Loci for Type 1 Diabetes , 2009, Diabetes.

[9]  S. Sarkar False discovery and false nondiscovery rates in single-step multiple testing procedures , 2006, math/0605607.

[10]  Neri Merhav,et al.  Hidden Markov processes , 2002, IEEE Trans. Inf. Theory.

[11]  R. A. Bailey,et al.  Robust associations of four new chromosome regions from genome-wide analyses of type 1 diabetes , 2007, Nature Genetics.

[12]  W. Wu,et al.  On false discovery control under dependence , 2008, 0803.1971.

[13]  B. Efron Correlation and Large-Scale Simultaneous Significance Testing , 2007 .

[14]  Wenguang Sun,et al.  Large‐scale multiple testing under dependence , 2009 .

[15]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[16]  Hongzhe Li,et al.  A Markov random field model for network-based analysis of genomic data , 2007, Bioinform..

[17]  R. Dougherty,et al.  FALSE DISCOVERY RATE ANALYSIS OF BRAIN DIFFUSION DIRECTION MAPS. , 2008, The annals of applied statistics.

[18]  J. Rioux,et al.  Autoimmune diseases: insights from genome-wide association studies. , 2008, Human molecular genetics.

[19]  L. Wasserman,et al.  Operating characteristics and extensions of the false discovery rate procedure , 2002 .

[20]  S. Gabriel,et al.  Efficiency and power in genetic association studies , 2005, Nature Genetics.

[21]  B. Efron Large-Scale Simultaneous Hypothesis Testing , 2004 .

[22]  Hongzhe Li,et al.  A hidden spatial-temporal Markov random field model for network-based analysis of time course gene expression data , 2008, 0803.3942.

[23]  Christopher J. Miller,et al.  Controlling the False-Discovery Rate in Astrophysical Data Analysis , 2001, astro-ph/0107034.

[24]  S. Dudoit,et al.  STATISTICAL METHODS FOR IDENTIFYING DIFFERENTIALLY EXPRESSED GENES IN REPLICATED cDNA MICROARRAY EXPERIMENTS , 2002 .

[25]  S. Dudoit,et al.  Multiple Testing. Part III. Procedures for Control of the Generalized Family-Wise Error Rate and Proportion of False Positives , 2004 .

[26]  Kai Wang,et al.  Pathway-based approaches for analysis of genomewide association studies. , 2007, American journal of human genetics.

[27]  Y. Benjamini,et al.  On the Adaptive Control of the False Discovery Rate in Multiple Testing With Independent Statistics , 2000 .

[28]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[29]  N. Meinshausen,et al.  Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses , 2005, math/0501289.

[30]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[31]  John D. Storey A direct approach to false discovery rates , 2002 .

[32]  D. Nyholt A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. , 2004, American journal of human genetics.

[33]  A. Owen Variance of the number of false discoveries , 2005 .

[34]  C. Hoggart,et al.  Genome-wide association analysis of metabolic traits in a birth cohort from a founder population , 2008, Nature Genetics.

[35]  Joseph T. Glessner,et al.  A genome-wide association study identifies KIAA0350 as a type 1 diabetes gene , 2007, Nature.

[36]  M. Newton Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis , 2008 .

[37]  Chiara Sabatti,et al.  False discovery rate in linkage and association genome screens for complex disorders. , 2003, Genetics.

[38]  Xing Qiu,et al.  Correlation Between Gene Expression Levels and Limitations of the Empirical Bayes Methodology for Finding Differentially Expressed Genes , 2005, Statistical applications in genetics and molecular biology.

[39]  A. Farcomeni Some Results on the Control of the False Discovery Rate under Dependence , 2007 .

[40]  D. Clayton,et al.  Genome-wide association study and meta-analysis finds over 40 loci affect risk of type 1 diabetes , 2009, Nature Genetics.

[41]  Wenge Guo,et al.  Adaptive Choice of the Number of Bootstrap Samples in Large Scale Multiple Testing , 2008, Statistical applications in genetics and molecular biology.

[42]  R. Fisher Statistical methods for research workers , 1927, Protoplasma.

[43]  Wei Pan,et al.  A mixture model approach to detecting differentially expressed genes with microarray data , 2003, Functional & Integrative Genomics.