Supervised Learning for Multi-Block Incomplete Data

In the supervised high dimensional settings with a large number of variables and a low number of individuals, one objective is to select the relevant variables and thus to reduce the dimension. That subspace selection is often managed with supervised tools. However, some data can be missing, compromising the validity of the sub-space selection. We propose a Partial Least Square (PLS) based method, called Multi-block Data-Driven sparse PLS mdd-sPLS, allowing jointly variable selection and subspace estimation while training and testing missing data imputation through a new algorithm called Koh-Lanta. This method was challenged through simulations against existing methods such as mean imputation, nipals, softImpute and imputeMFA. In the context of supervised analysis of high dimensional data, the proposed method shows the lowest prediction error of the response variables. So far this is the only method combining data imputation and response variable prediction. The superiority of the supervised multi-block mdd-sPLS method increases with the intra-block and inter-block correlations. The application to a real data-set from a rVSV-ZEBOV Ebola vaccine trial revealed interesting and biologically relevant results. The method is implemented in a R-package available on the CRAN and a Python-package available on pypi.

[1]  J. Josse,et al.  missMDA: A Package for Handling Missing Values in Multivariate Data Analysis , 2016 .

[2]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[3]  Philippe Besse,et al.  Statistical Applications in Genetics and Molecular Biology A Sparse PLS for Variable Selection when Integrating Omics Data , 2011 .

[4]  Andrea Montanari,et al.  Sparse PCA via Covariance Thresholding , 2013, J. Mach. Learn. Res..

[5]  V. Frouin,et al.  Variable selection for generalized canonical correlation analysis. , 2014, Biostatistics.

[6]  Mostafa El Qannari,et al.  From Multiblock Partial Least Squares to Multiblock Redundancy Analysis. A Continuum Approach , 2011, Informatica.

[7]  Myrtille Vivien,et al.  Two Approaches for Discriminant Partial Least Squares , 2003 .

[8]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[9]  A. Smilde,et al.  Deflation in multiblock PLS , 2001 .

[10]  R. Penrose On best approximate solutions of linear matrix equations , 1956, Mathematical Proceedings of the Cambridge Philosophical Society.

[11]  R. Manne Analysis of two partial-least-squares algorithms for multivariate calibration , 1987 .

[12]  P. A. Taylor,et al.  Missing data methods in PCA and PLS: Score calculations with incomplete observations , 1996 .

[13]  Rodolphe Thiébaut,et al.  Systems Vaccinology Identifies an Early Innate Immune Signature as a Correlate of Antibody Responses to the Ebola Vaccine rVSV-ZEBOV , 2017, Cell Reports.

[14]  M. Wainwright,et al.  High-dimensional analysis of semidefinite relaxations for sparse principal components , 2008, 2008 IEEE International Symposium on Information Theory.

[15]  Trevor J. Hastie,et al.  Matrix completion and low-rank SVD via fast alternating least squares , 2014, J. Mach. Learn. Res..

[16]  Johan A. Westerhuis,et al.  Multivariate modelling of the tablet manufacturing process with wet granulation for tablet optimization and in-process control , 1997 .

[17]  Michael J. Piovoso,et al.  On unifying multiblock analysis with application to decentralized process monitoring , 2001 .

[18]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[19]  Gordon Vansant,et al.  Gene expression profiling of rat livers reveals indicators of potential adverse effects. , 2004, Toxicological sciences : an official journal of the Society of Toxicology.

[20]  A. Höskuldsson PLS regression methods , 1988 .

[21]  I. Johnstone,et al.  Sparse Principal Components Analysis , 2009, 0901.4392.

[22]  Dimitris Bertsimas,et al.  From Predictive Methods to Missing Data Imputation: An Optimization Approach , 2017, J. Mach. Learn. Res..

[23]  S. Wold,et al.  The multivariate calibration problem in chemistry solved by the PLS method , 1983 .

[24]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[25]  B. Nadler,et al.  DO SEMIDEFINITE RELAXATIONS SOLVE SPARSE PCA UP TO THE INFORMATION LIMIT , 2013, 1306.3690.

[26]  L. E. Wangen,et al.  A multiblock partial least squares algorithm for investigating complex chemical systems , 1989 .

[27]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[28]  S. Keleş,et al.  Sparse partial least squares regression for simultaneous dimension reduction and variable selection , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[29]  Adam J. Rothman,et al.  Generalized Thresholding of Large Covariance Matrices , 2009 .

[30]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[31]  A. Tenenhaus,et al.  Regularized Generalized Canonical Correlation Analysis , 2011, Eur. J. Oper. Res..

[32]  J. Josse,et al.  Handling missing values in multiple factor analysis , 2013 .

[33]  Weidong Liu,et al.  Adaptive Thresholding for Sparse Covariance Matrix Estimation , 2011, 1102.2237.

[34]  I. Jolliffe,et al.  A Modified Principal Component Technique Based on the LASSO , 2003 .

[35]  Michael I. Jordan,et al.  A Direct Formulation for Sparse Pca Using Semidefinite Programming , 2004, NIPS 2004.

[36]  I. Johnstone,et al.  On Consistency and Sparsity for Principal Components Analysis in High Dimensions , 2009, Journal of the American Statistical Association.

[37]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[38]  Ignacio González,et al.  integrOmics: an R package to unravel relationships between two omics datasets , 2009, Bioinform..