New Developments in Sparse PLS Regression

Methods based on partial least squares (PLS) regression, which has recently gained much attention in the analysis of high-dimensional genomic datasets, have been developed since the early 2000s for performing variable selection. Most of these techniques rely on tuning parameters that are often determined by cross-validation (CV) based methods, which raises essential stability issues. To overcome this, we have developed a new dynamic bootstrap-based method for significant predictor selection, suitable for both PLS regression and its incorporation into generalized linear models (GPLS). It relies on establishing bootstrap confidence intervals, which allows testing of the significance of predictors at preset type I risk α, and avoids CV. We have also developed adapted versions of sparse PLS (SPLS) and sparse GPLS regression (SGPLS), using a recently introduced non-parametric bootstrap-based technique to determine the numbers of components. We compare their variable selection reliability and stability concerning tuning parameters determination and their predictive ability, using simulated data for PLS and real microarray gene expression data for PLS-logistic classification. We observe that our new dynamic bootstrap-based method has the property of best separating random noise in y from the relevant information with respect to other methods, leading to better accuracy and predictive abilities, especially for non-negligible noise levels.

[1]  Danh V. Nguyena,et al.  On partial least squares dimension reduction for microarray-based classi'cation: a simulation study , 2004 .

[2]  Wei Sun,et al.  Consistent selection of tuning parameters via variable selection stability , 2012, J. Mach. Learn. Res..

[3]  Danh V. Nguyen,et al.  Multi-class cancer classification via partial least squares with gene expression profiles , 2002, Bioinform..

[4]  Gersende Fort,et al.  Classification using partial least squares with penalized logistic regression , 2005, Bioinform..

[5]  Anne-Laure Boulesteix,et al.  Survival prediction using gene expression data: A review and comparison , 2009, Comput. Stat. Data Anal..

[6]  Jean-Pierre Gauchi,et al.  Selecting both latent and explanatory variables in the PLS1 regression model , 2003 .

[7]  A. Boulesteix PLS Dimension Reduction for Classification with Microarray Data , 2004, Statistical applications in genetics and molecular biology.

[8]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[9]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[10]  B. Marx Iteratively reweighted partial least squares estimation for generalized linear regression , 1996 .

[11]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[12]  Deepak Gupta,et al.  Improved 2-norm Based Fuzzy Least Squares Twin Support Vector Machine , 2018, 2018 IEEE Symposium Series on Computational Intelligence (SSCI).

[13]  Anne-Laure Boulesteix,et al.  Partial least squares: a versatile tool for the analysis of high-dimensional genomic data , 2006, Briefings Bioinform..

[14]  Adaptive Sparse PLS for Logistic Regression , 2015 .

[15]  Myriam Maumy-Bertrand,et al.  Deviance residuals-based sparse PLS and sparse kernel PLS regression for censored data , 2015, Bioinform..

[16]  Mira Ayadi,et al.  Gene Expression Classification of Colon Cancer into Molecular Subtypes: Characterization, Validation, and Prognostic Value , 2013, PLoS medicine.

[17]  H. Stein,et al.  Molecular profiles and clinical outcome of stage UICC II colon cancer patients , 2011, International Journal of Colorectal Disease.

[18]  R. Tibshirani,et al.  REJOINDER TO "LEAST ANGLE REGRESSION" BY EFRON ET AL. , 2004, math/0406474.

[19]  R. Gentleman,et al.  Classification Using Generalized Partial Least Squares , 2005 .

[20]  Franck Picard,et al.  High dimensional classification with combined adaptive sparse PLS and logistic regression , 2015, Bioinform..

[21]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..

[22]  Danh V. Nguyen,et al.  On partial least squares dimension reduction for microarray-based classification: a simulation study , 2004, Comput. Stat. Data Anal..

[23]  S. Wold,et al.  A randomization test for PLS component selection , 2007 .

[24]  Deepak Gupta,et al.  A Two-Norm Squared Fuzzy-Based Least Squares Twin Parametric-Margin Support Vector Machine , 2019 .

[25]  A. Höskuldsson PLS regression methods , 1988 .

[26]  Deepak Gupta,et al.  Entropy based fuzzy least squares twin support vector machine for class imbalance learning , 2018, Applied Intelligence.

[27]  S. Wold,et al.  The multivariate calibration problem in chemistry solved by the PLS method , 1983 .

[28]  Umesh Gupta,et al.  Kernel Target Alignment based Fuzzy Least Square Twin Bounded Support Vector Machine , 2018, 2018 IEEE Symposium Series on Computational Intelligence (SSCI).

[29]  Michel Tenenhaus,et al.  PLS generalised linear regression , 2005, Comput. Stat. Data Anal..

[30]  Myriam Maumy-Bertrand,et al.  A new universal resample-stable bootstrap-based stopping criterion for PLS component construction , 2015, Stat. Comput..

[31]  S. Wold,et al.  The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses , 1984 .

[32]  Sunduz Keles,et al.  Sparse Partial Least Squares Classification for High Dimensional Data , 2010, Statistical applications in genetics and molecular biology.

[33]  S. Keleş,et al.  Sparse partial least squares regression for simultaneous dimension reduction and variable selection , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[34]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[35]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[36]  Masashi Sugiyama,et al.  The Degrees of Freedom of Partial Least Squares Regression , 2010, 1002.4112.

[37]  T. Ørntoft,et al.  Metastasis-Associated Gene Expression Changes Predict Poor Outcomes in Patients with Dukes Stage B and C Colorectal Cancer , 2009, Clinical Cancer Research.

[38]  F. Bertrand,et al.  Comparaison de variantes de régressions logistiques PLS et de régression PLS sur variables qualitatives : application aux données d'allélotypage , 2010 .

[39]  D. Firth Bias reduction of maximum likelihood estimates , 1993 .

[40]  Tahir Mehmood,et al.  A review of variable selection methods in Partial Least Squares Regression , 2012 .

[41]  Deepak Gupta,et al.  Least squares large margin distribution machine for regression , 2021, Applied Intelligence.

[42]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[43]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .