Two‐sample Comparison Based on Prediction Error, with Applications to Candidate Gene Association Studies

To take advantage of the increasingly available high‐density SNP maps across the genome, various tests that compare multilocus genotypes or estimated haplotypes between cases and controls have been developed for candidate gene association studies. Here we view this two‐sample testing problem from the perspective of supervised machine learning and propose a new association test. The approach adopts the flexible and easy‐to‐understand classification tree model as the learning machine, and uses the estimated prediction error of the resulting prediction rule as the test statistic. This procedure not only provides an association test but also generates a prediction rule that can be useful in understanding the mechanisms underlying complex disease. Under the set‐up of a haplotype‐based transmission/disequilibrium test (TDT) type of analysis, we find through simulation studies that the proposed procedure has the correct type I error rates and is robust to population stratification. The power of the proposed procedure is sensitive to the chosen prediction error estimator. Among commonly used prediction error estimators, the .632+ estimator results in a test that has the best overall performance. We also find that the test using the .632+ estimator is more powerful than the standard single‐point TDT analysis, the Pearson's goodness‐of‐fit test based on estimated haplotype frequencies, and two haplotype‐based global tests implemented in the genetic analysis package FBAT. To illustrate the application of the proposed method in population‐based association studies, we use the procedure to study the association between non‐Hodgkin lymphoma and the IL10 gene.

[1]  Kai Yu,et al.  Assessing performance of prediction rules in machine learning. , 2006, Pharmacogenomics.

[2]  Hongyu Zhao,et al.  Haplotype analysis in population genetics and association studies. , 2003, Pharmacogenomics.

[3]  D. Schaid Evaluating associations of haplotypes with traits , 2004, Genetic epidemiology.

[4]  Qiuying Sha,et al.  Tests of Association Between Quantitative Traits and Haplotypes In A Reduced‐Dimensional Space , 2005, Annals of human genetics.

[5]  D. Conti,et al.  SNPs, haplotypes, and model selection in a candidate gene region: The SIMPle analysis for multilocus data , 2004, Genetic epidemiology.

[6]  Chengjie Xiong,et al.  A Haplotype Similarity Based Transmission/Disequilibrium Test under Founder Heterogeneity , 2005, Annals of human genetics.

[7]  M. Province,et al.  19 Classification methods for confronting heterogeneity , 2001 .

[8]  K Rohrschneider,et al.  Leber congenital amaurosis and retinitis pigmentosa with Coats-like exudative vasculopathy are associated with mutations in the crumbs homologue 1 (CRB1) gene. , 2001, American journal of human genetics.

[9]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[10]  N. Kaplan,et al.  On the advantage of haplotype analysis in the presence of multiple disease susceptibility alleles , 2002, Genetic epidemiology.

[11]  M. Reilly,et al.  MDR and PRP: A Comparison of Methods for High-Order Genotype-Phenotype Associations , 2005, Human Heredity.

[12]  W. Ewens,et al.  Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). , 1993, American journal of human genetics.

[13]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[14]  C Charles Gu,et al.  Genetic association mapping under founder heterogeneity via weighted haplotype similarity analysis in candidate genes , 2004, Genetic epidemiology.

[15]  Xin Xu,et al.  Implementing a unified approach to family‐based tests of association , 2000, Genetic epidemiology.

[16]  Theodore R Holford,et al.  Cytokine polymorphisms in the Th1/Th2 pathway and susceptibility to non-Hodgkin lymphoma. , 2006, Blood.

[17]  Polina Golland,et al.  Permutation Tests for Classification: Towards Statistical Significance in Image-Based Studies , 2003, IPMI.

[18]  M. Province,et al.  Using Tree‐Based Recursive Partitioning Methods to Group Haplotypes for Increased Power in Association Studies , 2005, Annals of human genetics.

[19]  Michael Knapp,et al.  Impact of genotyping errors on type I error rate of the haplotype-sharing transmission/disequilibrium test (HS-TDT). , 2004, American journal of human genetics.

[20]  Peter Boyle,et al.  Cytokine polymorphisms in the Th1/Th2 pathway and susceptibility to non-Hodgkin lymphoma. , 2006, Blood.

[21]  Heping Zhang,et al.  Use of classification trees for association studies , 2000, Genetic epidemiology.

[22]  N L Kaplan,et al.  Removing the sampling restrictions from family-based tests of association for a quantitative-trait locus. , 2000, American journal of human genetics.

[23]  Jason Cooper,et al.  Use of unphased multilocus genotype data in indirect association studies , 2004, Genetic epidemiology.

[24]  M. LeBlanc,et al.  Logic Regression , 2003 .

[25]  M A Province,et al.  Tree‐based recursive partitioning methods for subdividing sibpairs into relatively more homogeneous subgroups , 2001, Genetic epidemiology.

[26]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[27]  Ruzong Fan,et al.  High-Resolution Association Mapping of Quantitative Trait Loci: A Population-Based Approach , 2006, Genetics.

[28]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[29]  Jianping Dong,et al.  Transmission/disequilibrium test based on haplotype sharing for tightly linked markers. , 2003, American journal of human genetics.

[30]  M. Xiong,et al.  Haplotypes vs single marker linkage disequilibrium tests: what do we gain? , 2001, European Journal of Human Genetics.

[31]  Burton H. Singer,et al.  Recursive partitioning in the health sciences , 1999 .

[32]  Xin Xu,et al.  Family‐based tests for associating haplotypes with general phenotype data: Application to asthma genetics , 2004, Genetic epidemiology.

[33]  J. Friedman On Multivariate Goodness-of-Fit and Two-Sample Testing , 2004 .

[34]  L. Excoffier,et al.  Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. , 1995, Molecular biology and evolution.

[35]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[36]  Low-Tone Ho,et al.  Tree-structured supervised learning and the genetics of hypertension. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[37]  K Roeder,et al.  Haplotype fine mapping by evolutionary trees. , 2000, American journal of human genetics.

[38]  Michael Knapp,et al.  Maximum‐likelihood estimation of haplotype frequencies in nuclear families , 2004, Genetic epidemiology.

[39]  M. Province,et al.  Classification methods for confronting heterogeneity. , 2001, Advances in genetics.