A Partially Linear Tree‐based Regression Model for Multivariate Outcomes

In the genetic study of complex traits, especially behavior related ones, such as smoking and alcoholism, usually several phenotypic measurements are obtained for the description of the complex trait, but no single measurement can quantify fully the complicated characteristics of the symptom because of our lack of understanding of the underlying etiology. If those phenotypes share a common genetic mechanism, rather than studying each individual phenotype separately, it is more advantageous to analyze them jointly as a multivariate trait to enhance the power to identify associated genes. We propose a multilocus association test for the study of multivariate traits. The test is derived from a partially linear tree-based regression model for multiple outcomes. This novel tree-based model provides a formal statistical testing framework for the evaluation of the association between a multivariate outcome and a set of candidate predictors, such as markers within a gene or pathway, while accommodating adjustment for other covariates. Through simulation studies we show that the proposed method has an acceptable type I error rate and improved power over the univariate outcome analysis, which studies each component of the complex trait separately with multiple-comparison adjustment. A candidate gene association study of multiple smoking-related phenotypes is used to demonstrate the application and advantages of this new method. The proposed method is general enough to be used for the assessment of the joint effect of a set of multiple risk factors on a multivariate outcome in other biomedical research settings.

[1]  Burton H. Singer,et al.  Recursive partitioning in the health sciences , 1999 .

[2]  Xin Xu,et al.  Combining dependent tests for linkage or association across multiple phenotypic traits. , 2003, Biostatistics.

[3]  N. Cook,et al.  Tree and spline based association analysis of gene–gene interaction models for ischemic stroke , 2004, Statistics in medicine.

[4]  Søren Højsgaard,et al.  The R Package geepack for Generalized Estimating Equations , 2005 .

[5]  Terry M Therneau,et al.  A partially linear tree‐based regression model for assessing complex joint gene–gene and gene–environment effects , 2007, Genetic epidemiology.

[6]  M. LeBlanc,et al.  Survival Trees by Goodness of Split , 1993 .

[7]  S. Dudoit,et al.  Resampling-based multiple testing for microarray data analysis , 2003 .

[8]  Heping Zhang Classification Trees for Multiple Binary Responses , 1998 .

[9]  S. Zeger,et al.  Longitudinal data analysis using generalized linear models , 1986 .

[10]  Joseph G. Ibrahim,et al.  Missing data methods in longitudinal studies: a review , 2009 .

[11]  Christoph Lange,et al.  A multivariate family-based association test using generalized estimating equations: FBAT-GEE. , 2003, Biostatistics.

[12]  K Yu,et al.  Two‐sample Comparison Based on Prediction Error, with Applications to Candidate Gene Association Studies , 2007, Annals of human genetics.

[13]  M. Province,et al.  Using Tree‐Based Recursive Partitioning Methods to Group Haplotypes for Increased Power in Association Studies , 2005, Annals of human genetics.

[14]  P. Speckman,et al.  Multivariate Regression Trees for Analysis of Abundance Data , 2004, Biometrics.

[15]  P. Fearnhead,et al.  Genome-wide association study of prostate cancer identifies a second risk locus at 8q24 , 2007, Nature Genetics.

[16]  M. Segal Tree-Structured Methods for Longitudinal Data , 1992 .

[17]  G. Swan,et al.  Gene–gene interactions between CYP2B6 and CYP2A6 in nicotine metabolism , 2007, Pharmacogenetics and genomics.

[18]  Nicholas G Martin,et al.  Cholinergic nicotinic receptor genes implicated in a nicotine dependence association study targeting 348 candidate genes with 3713 SNPs. , 2007, Human molecular genetics.

[19]  Heping Zhang,et al.  Use of classification trees for association studies , 2000, Genetic epidemiology.