Kernel Machine Approach to Testing the Significance of Multiple Genetic Markers for Risk Prediction

There is growing evidence that genomic and proteomic research holds great potential for changing irrevocably the practice of medicine. The ability to identify important genomic and biological markers for risk assessment can have a great impact in public health from disease prevention, to detection, to treatment selection. However, the potentially large number of markers and the complexity in the relationship between the markers and the outcome of interest impose a grand challenge in developing accurate risk prediction models. The standard approach to identifying important markers often assesses the marginal effects of individual markers on a phenotype of interest. When multiple markers relate to the phenotype simultaneously via a complex structure, such a type of marginal analysis may not be effective. To overcome such difficulties, we employ a kernel machine Cox regression framework and propose an efficient score test to assess the overall effect of a set of markers, such as genes within a pathway or a network, on survival outcomes. The proposed test has the advantage of capturing the potentially nonlinear effects without explicitly specifying a particular nonlinear functional form. To approximate the null distribution of the score statistic, we propose a simple resampling procedure that can be easily implemented in practice. Numerical studies suggest that the test performs well with respect to both empirical size and power even when the number of variables in a gene set is not small compared to the sample size.

[1]  D. Cox Regression Models and Life-Tables , 1972 .

[2]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[3]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[4]  David R. Cox,et al.  Regression models and life tables (with discussion , 1972 .

[5]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[6]  Yoonkyung Lee,et al.  Classification of Multiple Cancer Types by Multicategory Support Vector Machines Using Gene Expression Data , 2003, Bioinform..

[7]  Martin D. Buhmann,et al.  Radial Basis Functions: Theory and Implementations: Preface , 2003 .

[8]  Z. Ying,et al.  A resampling method based on pivotal estimating functions , 1994 .

[9]  H. Hosoi,et al.  Antitumor Activity of Gefitinib in Malignant Rhabdoid Tumor Cells In vitro and In vivo , 2004, Clinical Cancer Research.

[10]  Yudong D. He,et al.  A Gene-Expression Signature as a Predictor of Survival in Breast Cancer , 2002 .

[11]  D. Pollard Empirical Processes: Theory and Applications , 1990 .

[12]  T. Crook,et al.  The p53 pathway in breast cancer , 2002, Breast Cancer Research.

[13]  Nicolas Le Roux,et al.  Learning Eigenfunctions Links Spectral Embedding and Kernel PCA , 2004, Neural Computation.

[14]  R. Young,et al.  Biomedical Discovery with DNA Arrays , 2000, Cell.

[15]  D. Commenges,et al.  Score test of homogeneity for survival data , 1995, Lifetime data analysis.

[16]  N. Hynes,et al.  The ErbB receptors and their role in cancer progression. , 2003, Experimental cell research.

[17]  Zhiliang Ying,et al.  Towards a general asymptotic theory for Cox model with staggered entry , 1997 .

[18]  Mikio L. Braun,et al.  Spectral properties of the kernel matrix and their relation to kernel methods in machine learning , 2005 .

[19]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[20]  J. Phan,et al.  Reproducibility of Differential Gene Detection across Multiple Microarray Studies , 2007, 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[21]  Xihong Lin,et al.  Semiparametric Regression of Multidimensional Genetic Pathway Data: Least‐Squares Kernel Machines and Linear Mixed Models , 2007, Biometrics.

[22]  Dawei Liu,et al.  Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models , 2008, BMC Bioinformatics.

[23]  Jelle J. Goeman,et al.  A global test for groups of genes: testing association with a clinical outcome , 2004, Bioinform..

[24]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[25]  Hongzhe Li,et al.  Kernel Cox Regression Models for Linking Gene Expression Profiles to Censored Survival Data , 2002, Pacific Symposium on Biocomputing.

[26]  R. Davies Hypothesis testing when a nuisance parameter is present only under the alternative , 1977 .

[27]  D. Pollard,et al.  $U$-Processes: Rates of Convergence , 1987 .

[28]  Yuhyun Park,et al.  Estimating subject-specific survival functions under the accelerated failure time model , 2003 .

[29]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Jelle J. Goeman,et al.  Testing association of a pathway with survival using gene expression data , 2005, Bioinform..

[31]  D. Harrington,et al.  Counting Processes and Survival Analysis , 1991 .

[32]  Sara van de Geer,et al.  Testing against a high dimensional alternative , 2006 .

[33]  G. Wahba,et al.  A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines , 1970 .

[34]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[35]  O. Olopade,et al.  Advances in Breast Cancer: Pathways to Personalized Medicine , 2008, Clinical Cancer Research.

[36]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[37]  R. Nicholson,et al.  EGFR and cancer prognosis. , 2001, European journal of cancer.

[38]  N. Breslow,et al.  Approximate inference in generalized linear mixed models , 1993 .

[39]  Tianxi Cai,et al.  Semiparametric regression analysis for clustered failure time data , 2000 .

[40]  R. Wooster,et al.  Breast cancer genetics: What we know and what we need , 2001, Nature Medicine.

[41]  Gilles Blanchard,et al.  On the Convergence of Eigenspaces in Kernel Principal Component Analysis , 2005, NIPS.

[42]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[43]  Alfons Meindl,et al.  Association of genetic variants in the Rho guanine nucleotide exchange factor AKAP13 with familial breast cancer. , 2006, Carcinogenesis.