Sparse Partial Least Squares Classification for High Dimensional Data

Partial least squares (PLS) is a well known dimension reduction method which has been recently adapted for high dimensional classification problems in genome biology. We develop sparse versions of the recently proposed two PLS-based classification methods using sparse partial least squares (SPLS). These sparse versions aim to achieve variable selection and dimension reduction simultaneously. We consider both binary and multicategory classification. We provide analytical and simulation-based insights about the variable selection properties of these approaches and benchmark them on well known publicly available datasets that involve tumor classification with high dimensional gene expression data. We show that incorporation of SPLS into a generalized linear model (GLM) framework provides higher sensitivity in variable selection for multicategory classification with unbalanced sample sizes between classes. As the sample size increases, the two-stage approach provides comparable sensitivity with better specificity in variable selection. In binary classification and multicategory classification with balanced sample sizes, the two-stage approach provides comparable variable selection and prediction accuracy as the GLM version and is computationally more efficient.

[1]  A. Albert,et al.  On the existence of maximum likelihood estimates in logistic regression models , 1984 .

[2]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[3]  Gersende Fort,et al.  Classification using partial least squares with penalized logistic regression , 2005, Bioinform..

[4]  S. Keleş,et al.  Sparse partial least squares regression for simultaneous dimension reduction and variable selection , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[5]  M. Barker,et al.  Partial least squares for discrimination , 2003 .

[6]  A. Zwinderman,et al.  Statistical Applications in Genetics and Molecular Biology Quantifying the Association between Gene Expressions and DNA-Markers by Penalized Canonical Correlation Analysis , 2011 .

[7]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[8]  Danh V. Nguyen,et al.  Multi-class cancer classification via partial least squares with gene expression profiles , 2002, Bioinform..

[9]  A. Boulesteix PLS Dimension Reduction for Classification with Microarray Data , 2004, Statistical applications in genetics and molecular biology.

[10]  Celia M. T. Greenwood,et al.  A modified score function estimator for multinomial logistic regression in small samples , 2002 .

[11]  B. Marx Iteratively reweighted partial least squares estimation for generalized linear regression , 1996 .

[12]  Sujuan Gao,et al.  A Solution to Separation and Multicollinearity in Multiple Logistic Regression. , 2008, Journal of data science : JDS.

[13]  Philippe Besse,et al.  Statistical Applications in Genetics and Molecular Biology A Sparse PLS for Variable Selection when Integrating Omics Data , 2011 .

[14]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..

[15]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[16]  M. Daumer,et al.  Evaluating Microarray-based Classifiers: An Overview , 2008, Cancer informatics.

[17]  Marcel Dettling,et al.  BagBoosting for tumor classification with gene expression data , 2004, Bioinform..

[18]  D. Firth Bias reduction of maximum likelihood estimates , 1993 .

[19]  Peter Bühlmann,et al.  Supervised clustering of genes , 2002, Genome Biology.

[20]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[21]  M. Schemper,et al.  A solution to the problem of separation in logistic regression , 2002, Statistics in medicine.

[22]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[23]  R. Gentleman,et al.  Classification Using Generalized Partial Least Squares , 2005 .