Ordered homogeneity pursuit lasso for group variable selection with applications to spectroscopic data

Abstract In high-dimensional data modeling, variable selection methods have been a popular choice to improve the prediction accuracy by effectively selecting the subset of informative variables, and such methods can enhance the model interpretability with sparse representation. In this study, we propose a novel group variable selection method named ordered homogeneity pursuit lasso (OHPL) that takes the homogeneity structure in high-dimensional data into account. OHPL is particularly useful in high-dimensional datasets with strongly correlated variables. We illustrate the approach using three real-world spectroscopic datasets and compare it with four state-of-the-art variable selection methods. The benchmark results on real-world data show that the proposed method is capable of identifying a small number of influential groups and has better prediction performance than its competitors. The OHPL method and the spectroscopic datasets are implemented and included in an R package OHPL available from https://ohpl.io .

[1]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[2]  R. Little,et al.  QUANTITATIVE MAGNETIC RESONANCE IMAGE ANALYSIS VIA THE EM ALGORITHM WITH STOCHASTIC VARIATION. , 2008, The annals of applied statistics.

[3]  John H. Kalivas,et al.  Overview of two‐norm (L2) and one‐norm (L1) Tikhonov regularization variants for full wavelength or sparse spectral multivariate calibration models or maintenance , 2012 .

[4]  S. D. Jong SIMPLS: an alternative approach to partial least squares regression , 1993 .

[5]  Yong-Huan Yun,et al.  A new method for wavelength interval selection that intelligently optimizes the locations, widths and combinations of the intervals. , 2015, The Analyst.

[6]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[7]  Runze Li,et al.  Statistical Challenges with High Dimensionality: Feature Selection in Knowledge Discovery , 2006, math/0602133.

[8]  J. Harezlak,et al.  Adaptive penalties for generalized Tikhonov regularization in statistical regression models with application to spectroscopy data , 2017, Journal of chemometrics.

[9]  Shigeki Nakauchi,et al.  Sparse regression for selecting fluorescence wavelengths for accurate prediction of food properties , 2016 .

[10]  Jianqing Fan,et al.  Homogeneity Pursuit , 2015, Journal of the American Statistical Association.

[11]  Ron Wehrens,et al.  The pls Package: Principal Component and Partial Least Squares Regression in R , 2007 .

[12]  Hongdong Li,et al.  Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration. , 2009, Analytica chimica acta.

[13]  S. Geer,et al.  Correlated variables in regression: Clustering and sparse estimation , 2012, 1209.5908.

[14]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[15]  Xiaotong Shen,et al.  high-dimensional data analysis , 1991 .

[16]  Dong-Sheng Cao,et al.  An efficient method of wavelength interval selection based on random frog for multivariate spectral calibration. , 2013, Spectrochimica acta. Part A, Molecular and biomolecular spectroscopy.

[17]  Ryan J. Tibshirani,et al.  Efficient Implementations of the Generalized Lasso Dual Path Algorithm , 2014, ArXiv.

[18]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[19]  Qing-Song Xu,et al.  Using variable combination population analysis for variable selection in multivariate calibration. , 2015, Analytica chimica acta.

[20]  Qing-Song Xu,et al.  Multi-step adaptive elastic-net: reducing false positives in high-dimensional variable selection , 2015 .

[21]  Jian-hui Jiang,et al.  Spectral regions selection to improve prediction ability of PLS models by changeable size moving window partial least squares and searching combination moving window partial least squares , 2004 .

[22]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[23]  Xiaotong Shen,et al.  Grouping Pursuit Through a Regularization Solution Surface , 2010, Journal of the American Statistical Association.

[24]  Qing-Song Xu,et al.  The equivalence of partial least squares and principal component regression in the sufficient dimension reduction framework , 2016 .

[25]  Qingsong Xu,et al.  Elastic Net Grouping Variable Selection Combined with Partial Least Squares Regression (EN-PLSR) for the Analysis of Strongly Multi-Collinear Spectroscopic Data , 2011, Applied spectroscopy.

[26]  Qing-Song Xu,et al.  Fisher optimal subspace shrinkage for block variable selection with applications to NIR spectroscopic analysis , 2016 .

[27]  Tahir Mehmood,et al.  The diversity in the applications of partial least squares: an overview , 2016 .

[28]  Walter D. Fisher On Grouping for Maximum Homogeneity , 1958 .

[29]  Peter Filzmoser,et al.  Review of sparse methods in regression and classification with application to chemometrics , 2012 .

[30]  Robert Tibshirani,et al.  Sparse regression and marginal testing using cluster prototypes. , 2015, Biostatistics.

[31]  Sunduz Keles,et al.  Sparse Partial Least Squares Classification for High Dimensional Data , 2010, Statistical applications in genetics and molecular biology.

[32]  R. Bellman,et al.  V. Adaptive Control Processes , 1964 .

[33]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[34]  R. Yu,et al.  An ensemble of Monte Carlo uninformative variable elimination for wavelength selection. , 2008, Analytica chimica acta.

[35]  Eun Sug Park,et al.  Bayesian variable selection in binary quantile regression , 2016 .

[36]  W Y Zhang,et al.  Discussion on `Sure independence screening for ultra-high dimensional feature space' by Fan, J and Lv, J. , 2008 .

[37]  W. Cai,et al.  A variable selection method based on uninformative variable elimination for multivariate calibration of near-infrared spectra , 2008 .

[38]  Wen Wu,et al.  Peak Alignment of Urine NMR Spectra Using Fuzzy Warping , 2006, J. Chem. Inf. Model..

[39]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[40]  Å. Rinnan,et al.  Application of near infrared reflectance (NIR) and fluorescence spectroscopy to analysis of microbiological and chemical properties of arctic soil , 2007 .

[41]  J. Kalivas Two data sets of near infrared spectra , 1997 .

[42]  S. Engelsen,et al.  Interval Partial Least-Squares Regression (iPLS): A Comparative Chemometric Study with an Example from Near-Infrared Spectroscopy , 2000 .

[43]  Paola Gramatica,et al.  Real External Predictivity of QSAR Models: How To Evaluate It? Comparison of Different Validation Criteria and Proposal of Using the Concordance Correlation Coefficient , 2011, J. Chem. Inf. Model..

[44]  S. Wold,et al.  Wavelength interval selection in multicomponent spectral analysis by moving window partial least-squares regression with applications to mid-infrared and near-infrared spectroscopic data. , 2002, Analytical chemistry.

[45]  Beata Walczak,et al.  Analysis of variance of designed chromatographic data sets: The analysis of variance-target projection approach. , 2015, Journal of chromatography. A.

[46]  Richard Bellman,et al.  Adaptive Control Processes - A Guided Tour (Reprint from 1961) , 2015, Princeton Legacy Library.

[47]  B Walczak,et al.  What can go wrong at the data normalization step for identification of biomarkers? , 2014, Journal of chromatography. A.

[48]  Toshio Odanaka,et al.  ADAPTIVE CONTROL PROCESSES , 1990 .

[49]  Tormod Næs,et al.  Multivariate calibration. I. Concepts and distinctions , 1984 .

[50]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[51]  Fei Wang,et al.  Fused lasso with the adaptation of parameter ordering in combining multiple studies with repeated measurements , 2016, Biometrics.

[52]  S. Keleş,et al.  Sparse partial least squares regression for simultaneous dimension reduction and variable selection , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[53]  Yi-Zeng Liang,et al.  Monte Carlo cross‐validation for selecting a model and estimating the prediction error in multivariate calibration , 2004 .

[54]  J. Kalivas,et al.  Using the L1 norm to select basis set vectors for multivariate calibration and calibration updating , 2016 .

[55]  Beata Walczak,et al.  Improvement of classification using robust soft classification rules for near-infrared reflectance spectral data , 2011 .

[56]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[57]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .