Distribution based truncation for variable selection in subspace methods for multivariate regression

Abstract Analysis of data containing a vast number of features, but only a limited number of informative ones, requires methods that can separate true signal from noise variables. One class of methods attempting this is the sparse partial least squares methods for regression (sparse PLS). This paper aims at improving the theoretical foundation, speed and robustness of such methods. A general justification of truncation of PLS loading weights is achieved through distribution theory and the central limit theorem. We also introduce a quick plug-in based truncation procedure based on a novel application of theory intended for analysis of variance for experiments without replicates. The result is a versatile and intuitive method that performs component-wise variable selection very efficiently and in a less ad hoc manner than existing methods. Prediction performance is on par with existing methods, while robustness is ensured through a better theoretical foundation.

[1]  J. Roger,et al.  CovSel: Variable selection for highly multivariate and multi-response calibration: Application to IR spectroscopy , 2011 .

[2]  Kristian Hovde Liland,et al.  Quantitative whole spectrum analysis with MALDI-TOF MS, Part II: Determining the concentration of milk in mixtures , 2009 .

[3]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[4]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[5]  I. Helland,et al.  Comparison of Prediction Methods when Only a Few Components are Relevant , 1994 .

[6]  P. Comon Independent Component Analysis , 1992 .

[7]  Tahir Mehmood,et al.  A review of variable selection methods in Partial Least Squares Regression , 2012 .

[8]  Tarja Rajalahti,et al.  Discriminating variable test and selectivity ratio plot: quantitative tools for interpretation and variable (biomarker) selection in complex spectral or chromatographic profiles. , 2009, Analytical chemistry.

[9]  Inge S. Helland,et al.  Relevant components in regression , 1993 .

[10]  Peter Filzmoser,et al.  Review of sparse methods in regression and classification with application to chemometrics , 2012 .

[11]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[12]  Kristian Hovde Liland,et al.  Customized baseline correction , 2011 .

[13]  S. Wold,et al.  The multivariate calibration problem in chemistry solved by the PLS method , 1983 .

[14]  Jianhua Z. Huang,et al.  Sparse Linear Discriminant Analysis with Applications to High Dimensional Low Sample Size Data , 2009 .

[15]  Philippe Besse,et al.  Statistical Applications in Genetics and Molecular Biology A Sparse PLS for Variable Selection when Integrating Omics Data , 2011 .

[16]  T. Næs,et al.  Canonical partial least squares—a unified PLS approach to classification and regression problems , 2009 .

[17]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[18]  P. Eilers Parametric time warping. , 2004, Analytical chemistry.

[19]  Jens Petter Wold,et al.  Raman Spectra of Biological Samples: A Study of Preprocessing Methods , 2006, Applied spectroscopy.

[20]  S. Keleş,et al.  Sparse partial least squares regression for simultaneous dimension reduction and variable selection , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[21]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[22]  R. Lenth Quick and easy analysis of unreplicated factorials , 1989 .

[23]  Hugo Kubinyi,et al.  3D QSAR in drug design : theory, methods and applications , 2000 .

[24]  Youngjo Lee,et al.  Sparse partial least-squares regression and its applications to high-throughput data analysis , 2011 .

[25]  Herman Midelfart,et al.  A mixture model approach to sample size estimation in two-sample comparative microarray experiments , 2008, BMC Bioinformatics.

[26]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[27]  Ulf G. Indahl,et al.  A twist to partial least squares regression , 2005 .

[28]  C. Jun,et al.  Performance of some variable selection methods when multicollinearity is present , 2005 .

[29]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[30]  Trygve Almøy,et al.  ST‐PLS: a multi‐directional nearest shrunken centroid type classifier via PLS , 2008 .

[31]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[32]  Kristian Hovde Liland,et al.  Optimal Choice of Baseline Correction for Multivariate Calibration of Spectra , 2010, Applied spectroscopy.