The Role of Proxy Genes in Predictive Models: An Application to Early Detection of Prostate Cancer

The most important predictor in a regression model may be a suppressor variable which does not predict the outcome variable directly but improves the overall prediction by enhancing the effects of other predictors in the model. The most important gene in a 9gene model for early detection of prostate cancer is the gene SP1, whose mean is not significantly different between cancer and medically defined normal subjects. We suggest that SP1 predicts the pre-cancer expression of genes that it regulates, including some that do have direct effects. We refer to such suppressor variables as ‘proxy genes’, and its associated genes that have direct effects as ‘prime genes’. We introduce the basic ideas of proxy genes using a simple 2-gene prime/proxy model, and then present the 9-gene + PSA model developed by Correlated Component Regression (CCR). CCR is a structured approach for developing a reliable predictive model based on prime and proxy genes from a potentially large pool of gene candidates to be included in the model (Magidson, 2010a, 2010b). Simulation results suggest that when one or more suppressor variables are among the potential predictors, CCR improves over alternative methods for analyzing high dimensional data such as popular penalty approaches and PLS regression.