Correlated Component Regression: A Prediction/Classification Methodology for Possibly Many Features

A new ensemble dimension reduction regression technique, called Correlated Component Regression (CCR), is proposed that predicts the dependent variable based on K correlated components. For K = 1, CCR is equivalent to the corresponding Naive Bayes solution, and for K = P, CCR is equivalent to traditional regression with P predictors. An optional step-down variable selection procedure provides a sparse solution, with each component defined as a linear combination of only P* < P predictors. For high-dimensional data, simulation results suggest that good prediction is generally attainable for K = 3 or 4 regardless of the number of predictors, and estimation is fast. When predictors include one or more suppressor variables, common with gene expression data, simulations based on linear regression, logistic regression and discriminant analysis suggest that CCR predicts outside the sample better than comparable approaches based on stepwise regression, penalized regression and/or PLS regression. A major reason for the improvement is that the CCR/step-down algorithm is much better than other sparse techniques in capturing important suppressor variables among the final predictors.

[1]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[2]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[3]  S. Keleş,et al.  Sparse partial least squares regression for simultaneous dimension reduction and variable selection , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[4]  R. Lyles,et al.  A Fresh Look at the Discriminant Function Approach for Estimating Crude or Adjusted Odds Ratios , 2009, The American statistician.

[5]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[6]  J. Magidson,et al.  A Fast Parsimonious Maximum Likelihood Approach for Predicting Outcome Variables from a Large Number of Predictors , 2010 .

[7]  Gersende Fort,et al.  Classification using partial least squares with penalized logistic regression , 2005, Bioinform..

[8]  P. Bickel,et al.  Some theory for Fisher''s linear discriminant function , 2004 .

[9]  Yichao Wu,et al.  Ultrahigh Dimensional Feature Selection: Beyond The Linear Model , 2009, J. Mach. Learn. Res..

[10]  Jay Magidson,et al.  The Role of Proxy Genes in Predictive Models: An Application to Early Detection of Prostate Cancer , 2010 .

[11]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[12]  Lynn Friedman,et al.  Graphical Views of Suppression and Multicollinearity in Multiple Linear Regression , 2005 .

[13]  R. Tibshirani,et al.  Prediction by Supervised Principal Components , 2006 .

[14]  J. Magidson Correlated Component Regression: A Fast Parsimonious Approach for Predicting Outcome Variables from a Large Number of Predictors , 2010 .