Test Selection with Application to Detecting Disease Association with Multiple SNPs

We consider the motivating problem of testing for association between a phenotype and multiple single nucleotide polymorphisms (SNPs) within a candidate gene or region. Various statistical approaches have been proposed, including those based on either (combining univariate) single-locus analyses or (multivariate) multilocus analyses. However, it is known in theory that there is no single uniformly most powerful test to detect association with multiple SNPs. On the other hand, several tests have been shown to be among frequent winners across a range of practical situations, but the identity of the most powerful one changes with the situation in an unknown way. Here we propose a novel test selection procedure to select from five such tests: a so-called UminP test that combines multiple univariate/single-locus score tests by taking the minimum of their p values as its test statistic, a multivariate score test and its two modifications, and a so-called sum test. We also illustrate its application to selecting genotype codings for the sum test since the performance of the sum test depends on its genotype coding in an unknown way. Our major contributions include the methodology of estimating the power of a given test with a given dataset and the idea of using the estimated power as the criterion for test selection. We also propose a fast simulation-based method to calculate p values for the test selection procedure and for any method of combining p values. Our numerical results indicated that the proposed test selection procedure always yielded power close to the most powerful test among the candidate tests at any given situation, and in particular, our proposed test selection performed either better than or as well as the popular combining method of taking the minimum p value of the candidate tests.

[1]  Yuhong Yang REGRESSION WITH MULTIPLE CANDIDATE MODELS: SELECTING OR MIXING? , 1999 .

[2]  F. Yates,et al.  Statistical methods for research workers. 5th edition , 1935 .

[3]  Kathryn Roeder,et al.  Analysis of single‐locus tests to detect gene/disease associations , 2005, Genetic epidemiology.

[4]  P. Donnelly,et al.  A new statistical method for haplotype reconstruction from population data. , 2001, American journal of human genetics.

[5]  Joseph L. Gastwirth,et al.  Trend Tests for Case-Control Studies of Genetic Markers: Power, Sample Size and Robustness , 2002, Human Heredity.

[6]  Hongyu Zhao,et al.  Haplotype analysis in population genetics and association studies. , 2003, Pharmacogenomics.

[7]  D. Schaid,et al.  Score tests for association between traits and haplotypes when linkage phase is ambiguous. , 2002, American journal of human genetics.

[8]  R. Fisher,et al.  Statistical Methods for Research Workers , 1930, Nature.

[9]  G. Galbraith,et al.  TRAF1-C5 as a Risk Locus for Rheumatoid Arthritis—A Genomewide Study , 2008 .

[10]  B Müller-Myhsok,et al.  Rapid simulation of P values for product methods and multiple-testing adjustment in association studies. , 2005, American journal of human genetics.

[11]  Tao Wang,et al.  Improved power by use of a weighted score test for linkage disequilibrium mapping. , 2007, American journal of human genetics.

[12]  Momiao Xiong,et al.  Generalized T2 test for genome association studies. , 2002, American journal of human genetics.

[13]  Jin-Ting Zhang Approximate and Asymptotic Distributions of Chi-Squared–Type Mixtures With Applications , 2005 .

[14]  Juliet M Chapman,et al.  Detecting Disease Associations due to Linkage Disequilibrium Using Haplotype Tags: A Class of Tests and the Determinants of Statistical Power , 2003, Human Heredity.

[15]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[16]  D. Y. Lin,et al.  An efficient Monte Carlo approach to assessing statistical significance in genomic studies , 2005, Bioinform..

[17]  Xiaotong Shen,et al.  Optimal Model Assessment, Selection, and Combination , 2006 .

[18]  John Whittaker,et al.  Analysis of multiple SNPs in a candidate gene or region , 2008, Genetic epidemiology.

[19]  Ruzong Fan,et al.  Genome association studies of complex diseases by case-control designs. , 2003, American journal of human genetics.

[20]  M. Kendall Theoretical Statistics , 1956, Nature.

[21]  Wei Pan,et al.  Asymptotic tests of association with multiple SNPs in linkage disequilibrium , 2009, Genetic epidemiology.

[22]  Sara van de Geer,et al.  Testing against a high dimensional alternative , 2006 .

[23]  Wei Pan,et al.  A Unified Framework for Detecting Genetic Association with Multiple SNPs in a Candidate Gene or Region: Contrasting Genotype Scores and LD Patterns between Cases and Controls , 2009, Human Heredity.

[24]  Jason Cooper,et al.  Use of unphased multilocus genotype data in indirect association studies , 2004, Genetic epidemiology.

[25]  Samuel Kotz,et al.  Continuous univariate distributions : distributions in statistics , 1970 .

[26]  Gang Zheng,et al.  Genetic model selection in two-phase analysis for case-control association studies. , 2008, Biostatistics.

[27]  Thomas M. Loughin,et al.  A systematic comparison of methods for combining p , 2004, Comput. Stat. Data Anal..

[28]  W. R. Buckland,et al.  Distributions in Statistics: Continuous Multivariate Distributions , 1974 .

[29]  B S Weir,et al.  Truncated product method for combining P‐values , 2002, Genetic epidemiology.

[30]  M. Boehnke,et al.  So many correlated tests, so little time! Rapid adjustment of P values for multiple correlated tests. , 2007, American journal of human genetics.