Weight randomization test for the selection of the number of components in PLS models

The selection of the optimal number of components remains a difficult but essential task in partial least squares (PLS). Randomization tests have the advantage of being automatic and they make use of the entire dataset, in contrary with the widely used cross‐validation approaches. Partial least squares modeling may include component(s) with a large amount of irrelevant data variation, and this might affect the model, depending on the assigned y‐loading (which is the regression coefficient in the latent domain). This has recently been indicated by us in the basic sequence framework with respect to the underlying theory of the PLS algorithm and presented to the chemometrics society. We will show in this work that this irrelevant data variation is the root cause of the difficulty in current methods for selecting the optimal number of components. For randomization tests, PLS models with nonsignificant components may result in false positive tests because of the incorrect assumption that “the components enter the model in a natural order”.

[1]  J. I The Design of Experiments , 1936, Nature.

[2]  E. Pitman Significance Tests Which May be Applied to Samples from Any Populations , 1937 .

[3]  仲上 稔,et al.  The m-Distribution As the General Formula of Intensity Distribution of Rapid Fading , 1957 .

[4]  M. Tweedie Statistical Properties of Inverse Gaussian Distributions. II , 1957 .

[5]  Raymond F. Boyce,et al.  Distribution of Badial Error in the Bivariate Elliptical Normal Distribution , 1962 .

[6]  D. R. Jensen LIMIT PROPERTIES OF NONCENTRAL MULTIVARIATE RAYLEIGH AND CHI-SQUARE DISTRIBUTIONS* , 1969 .

[7]  Allan H. Marcus Power Sum Distributions: An Easier Approach Using the Wald Distribution , 1976 .

[8]  E. Montroll,et al.  Maximum entropy formalism, fractals, scaling phenomena, and 1/f noise: A tale of tails , 1983 .

[9]  Silvia Lanteri,et al.  Chemometrics in Food Chemistry , 1987, Chemometrics and Species Identification.

[10]  J. Leroy Folks,et al.  The Inverse Gaussian Distribution: Theory: Methodology, and Applications , 1988 .

[11]  H. M. Heise,et al.  Calibration modeling by partial least-squares and principal component regression and its optimization using an improved leverage correction for prediction testing , 1990 .

[12]  Hilko van der Voet,et al.  Comparing the predictive accuracy of models using a simple randomization test , 1994 .

[13]  P. Good,et al.  Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses , 1995 .

[14]  Barry M. Wise,et al.  The process chemometrics approach to process monitoring and fault detection , 1995 .

[15]  José Manuel Andrade,et al.  An empirical approach to update multivariate regression models intended for routine industrial use , 2000 .

[16]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[17]  S. Macho,et al.  Near-infrared spectroscopy and multivariate calibration for the quantitative determination of certain properties in the petrochemical industry , 2002 .

[18]  Steven D. Brown,et al.  Transfer of multivariate calibration models: a review , 2002 .

[19]  Rolf Ergon,et al.  Informative PLS score-loading plots for process understanding , 2004 .

[20]  David R. Clark,et al.  A Primer on the Exponential Family of Distributions , 2004 .

[21]  Michel Tenenhaus,et al.  PLS methodology to study relationships between hedonic judgements and product characteristics , 2005 .

[22]  Michel Tenenhaus,et al.  PLS path modeling , 2005, Comput. Stat. Data Anal..

[23]  M. Mitreva,et al.  Alpha-gliadin genes from the A, B, and D genomes of wheat contain different sets of celiac disease epitopes , 2006, BMC Genomics.

[24]  Rolf Ergon,et al.  Informative PLS score-loading plots for process understanding and monitoring , 2005 .

[25]  Harald Martens,et al.  Reducing over-optimism in variable selection by cross-model validation , 2006 .

[26]  R. A. van den Berg,et al.  Centering, scaling, and transformations: improving the biological information content of metabolomics data , 2006, BMC Genomics.

[27]  S. Wold,et al.  A randomization test for PLS component selection , 2007 .

[28]  H. Abdi Partial Least Square Regression PLS-Regression , 2007 .

[29]  N. M. Faber,et al.  How to avoid over-fitting in multivariate calibration--the conventional validation approach and an alternative. , 2007, Analytica chimica acta.

[30]  Age K. Smilde,et al.  UvA-DARE ( Digital Academic Repository ) Assessment of PLSDA cross validation , 2008 .

[31]  A. Smilde,et al.  Assessing the statistical validity of proteomics based biomarkers. , 2007, Analytica chimica acta.

[32]  Bjørn K. Alsberg,et al.  Cross model validation and optimisation of bilinear regression models , 2008 .

[33]  C. Gendrin,et al.  Pharmaceutical applications of vibrational chemical imaging and chemometrics: a review. , 2008, Journal of pharmaceutical and biomedical analysis.

[34]  P. Filzmoser,et al.  Repeated double cross validation , 2009 .

[35]  Monica Casale,et al.  Application of Chemometrics to Food Chemistry , 2009 .

[36]  Marc A. Dubé,et al.  A Critical Overview of Sensors for Monitoring Polymerizations , 2009 .

[37]  C. Sayer,et al.  In Line Monitoring of VAc‐BuA Emulsion Polymerization Reaction in a Continuous Pulsed Sieve Plate Reactor using NIR Spectroscopy , 2010 .

[38]  Rasmus Bro,et al.  Some common misunderstandings in chemometrics , 2010 .

[39]  Joshua Ottaway,et al.  Updating a synchronous fluorescence spectroscopic virgin olive oil adulteration calibration to a new geographical region. , 2011, Journal of agricultural and food chemistry.

[40]  Age K. Smilde,et al.  Double-check: validation of diagnostic statistics for PLS-DA models in metabolomics studies , 2011, Metabolomics.

[41]  A. Smilde,et al.  A lipidomic analysis approach to evaluate the response to cholesterol-lowering food intake , 2011, Metabolomics.

[42]  Marko Sarstedt,et al.  PLS-SEM: Indeed a Silver Bullet , 2011 .

[43]  A. D. Panagopoulos,et al.  On the Earth-Space Site Diversity Modeling: A Novel Physical-Mathematical Outage Prediction Model , 2012, IEEE Transactions on Antennas and Propagation.

[44]  Alexander Golbraikh,et al.  Does Rational Selection of Training and Test Sets Improve the Outcome of QSAR Modeling? , 2012, J. Chem. Inf. Model..

[45]  Lutgarde M. C. Buydens,et al.  Breaking with trends in pre-processing? , 2013 .

[46]  Charilaos I. Kourogiorgas,et al.  New physical-mathematical model for predicting slant-path rain attenuation statistics based on inverse Gaussian distribution , 2013 .

[47]  Max Kuhn,et al.  Applied Predictive Modeling , 2013 .

[48]  A. Panagopoulos,et al.  A New Method for the Prediction of Outage Probability of LOS Terrestrial Links Operating Above 10 GHz , 2013, IEEE Antennas and Wireless Propagation Letters.

[49]  C. De Bleye,et al.  Data processing of vibrational chemical imaging for pharmaceutical applications. , 2014, Journal of pharmaceutical and biomedical analysis.

[50]  Lutgarde M. C. Buydens,et al.  Variable importance in PLS in the presence of autocorrelated data — Case studies in manufacturing processes , 2014 .

[51]  Lutgarde M. C. Buydens,et al.  Interpretation of variable importance in Partial Least Squares with Significance Multivariate Correlation (sMC) , 2014 .

[52]  Richard G. Brereton,et al.  A short history of chemometrics: a personal view , 2014 .

[53]  Richard G. Brereton A short history of chemometrics: a personal view , 2014 .

[54]  Lutgarde M. C. Buydens,et al.  Novel unified framework for latent modeling and its interpretation , 2015 .

[55]  Emma Brodrick,et al.  Data size reduction strategy for the classification of breath and air samples using multicapillary column-ion mobility spectrometry. , 2015, Analytical chemistry.

[56]  Emma Brodrick,et al.  Breath analysis: translation into clinical practice , 2015, Journal of breath research.

[57]  Emma Brodrick,et al.  Increasing conclusiveness of clinical breath analysis by improved baseline correction of multi capillary column - ion mobility spectrometry (MCC-IMS) data. , 2016, Journal of pharmaceutical and biomedical analysis.