Boosting model performance and interpretation by entangling preprocessing selection and variable selection.

The aim of data preprocessing is to remove data artifacts-such as a baseline, scatter effects or noise-and to enhance the contextually relevant information. Many preprocessing methods exist to deliver one or more of these benefits, but which method or combination of methods should be used for the specific data being analyzed is difficult to select. Recently, we have shown that a preprocessing selection approach based on Design of Experiments (DoE) enables correct selection of highly appropriate preprocessing strategies within reasonable time frames. In that approach, the focus was solely on improving the predictive performance of the chemometric model. This is, however, only one of the two relevant criteria in modeling: interpretation of the model results can be just as important. Variable selection is often used to achieve such interpretation. Data artifacts, however, may hamper proper variable selection by masking the true relevant variables. The choice of preprocessing therefore has a huge impact on the outcome of variable selection methods and may thus hamper an objective interpretation of the final model. To enhance such objective interpretation, we here integrate variable selection into the preprocessing selection approach that is based on DoE. We show that the entanglement of preprocessing selection and variable selection not only improves the interpretation, but also the predictive performance of the model. This is achieved by analyzing several experimental data sets of which the true relevant variables are available as prior knowledge. We show that a selection of variables is provided that complies more with the true informative variables compared to individual optimization of both model aspects. Importantly, the approach presented in this work is generic. Different types of models (e.g. PCR, PLS, …) can be incorporated into it, as well as different variable selection methods and different preprocessing methods, according to the taste and experience of the user. In this work, the approach is illustrated by using PLS as model and PPRV-FCAM (Predictive Property Ranked Variable using Final Complexity Adapted Models) for variable selection.

[1]  Philippe Besse,et al.  Statistical Applications in Genetics and Molecular Biology A Sparse PLS for Variable Selection when Integrating Omics Data , 2011 .

[2]  Zhijin Wu,et al.  A review of statistical methods for preprocessing oligonucleotide microarrays. , 2009, Statistical methods in medical research.

[3]  Marc A. Dubé,et al.  A Critical Overview of Sensors for Monitoring Polymerizations , 2009 .

[4]  Vincent Baeten,et al.  A Backward Variable Selection method for PLS regression (BVSPLS). , 2009, Analytica chimica acta.

[5]  Lutgarde M. C. Buydens,et al.  Interpretation of variable importance in Partial Least Squares with Significance Multivariate Correlation (sMC) , 2014 .

[6]  Lutgarde M. C. Buydens,et al.  Breaking with trends in pre-processing? , 2013 .

[7]  E. K. Kemsley,et al.  A comparison of variate pre-selection methods for use in partial least squares regression: A case study on NIR spectroscopy applied to monitoring beer fermentation , 2009 .

[8]  Frans van den Berg,et al.  Review of the most common pre-processing techniques for near-infrared spectra , 2009 .

[9]  Marwa S. Elazazy,et al.  Interaction of p-synephrine with p-chloranil: experimental design and multiple response optimization , 2016 .

[10]  R. Barnes,et al.  Standard Normal Variate Transformation and De-Trending of Near-Infrared Diffuse Reflectance Spectra , 1989 .

[11]  M Daszykowski,et al.  Start-to-end processing of two-dimensional gel electrophoretic images. , 2007, Journal of chromatography. A.

[12]  D. Massart,et al.  Elimination of uninformative variables for multivariate calibration. , 1996, Analytical chemistry.

[13]  F. Marini,et al.  Validation of chemometric models - a tutorial. , 2015, Analytica chimica acta.

[14]  L. Buydens,et al.  Predictive-property-ranked variable reduction in partial least squares modelling with final complexity adapted models: comparison of properties for ranking. , 2013, Analytica chimica acta.

[15]  H. Goicoechea,et al.  Experimental design and multiple response optimization. Using the desirability function in analytical methods development. , 2014, Talanta.

[16]  Wen‐Jun Zhang,et al.  Comparison of different methods for variable selection , 2001 .

[17]  Tahir Mehmood,et al.  A review of variable selection methods in Partial Least Squares Regression , 2012 .

[18]  M. Bezerra,et al.  Response surface methodology (RSM) as a tool for optimization in analytical chemistry. , 2008, Talanta.

[19]  S. Tsakovski,et al.  Comparison of the variable importance in projection (VIP) and of the selectivity ratio (SR) methods for variable selection and interpretation , 2015 .

[20]  Jan Gerretzen,et al.  Simple and Effective Way for Data Preprocessing Selection Based on Design of Experiments. , 2015, Analytical chemistry.

[21]  Jonas Johansson,et al.  Comparison of different variable selection methods conducted on NIR transmission measurements on intact tablets , 2003 .

[22]  S. Engelsen,et al.  Interval Partial Least-Squares Regression (iPLS): A Comparative Chemometric Study with an Example from Near-Infrared Spectroscopy , 2000 .

[23]  Åsmund Rinnan,et al.  Pre-processing in vibrational spectroscopy – when, why and how , 2014 .

[24]  C. Sayer,et al.  In Line Monitoring of VAc‐BuA Emulsion Polymerization Reaction in a Continuous Pulsed Sieve Plate Reactor using NIR Spectroscopy , 2010 .

[25]  M. Forina,et al.  Iterative predictor weighting (IPW) PLS: a technique for the elimination of useless predictors in regression problems , 1999 .

[26]  Yvan Vander Heyden,et al.  Improved variable reduction in partial least squares modelling based on predictive-property-ranked variables and adaptation of partial least squares complexity. , 2011, Analytica chimica acta.

[27]  S. Wijmenga,et al.  NMR and pattern recognition methods in metabolomics: from data acquisition to biomarker discovery: a review. , 2012, Analytica chimica acta.

[28]  L. Buydens,et al.  A novel approach for analyzing gas chromatography-mass spectrometry/olfactometry data , 2015 .

[29]  P. Eilers Parametric time warping. , 2004, Analytical chemistry.

[30]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.