Evaluation of the effect of chance correlations on variable selection using Partial Least Squares-Discriminant Analysis.

Variable subset selection is often mandatory in high throughput metabolomics and proteomics. However, depending on the variable to sample ratio there is a significant susceptibility of variable selection towards chance correlations. The evaluation of the predictive capabilities of PLSDA models estimated by cross-validation after feature selection provides overly optimistic results if the selection is performed on the entire set and no external validation set is available. In this work, a simulation of the statistical null hypothesis is proposed to test whether the discrimination capability of a PLSDA model after variable selection estimated by cross-validation is statistically higher than that attributed to the presence of chance correlations in the original data set. Statistical significance of PLSDA CV-figures of merit obtained after variable selection is expressed by means of p-values calculated by using a permutation test that included the variable selection step. The reliability of the approach is evaluated using two variable selection methods on experimental and simulated data sets with and without induced class differences. The proposed approach can be considered as a useful tool when no external validation set is available and provides a straightforward way to evaluate differences between variable selection methods.

[1]  Marcel J. T. Reinders,et al.  Fewer permutations, more accurate P-values , 2009, Bioinform..

[2]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[3]  H. M. Vinkers,et al.  Improving QSAR models for the biological activity of HIV Reverse Transcriptase inhibitors: Aspects of outlier detection and uninformative variable elimination. , 2005, Talanta.

[4]  John C. Lindon,et al.  The handbook of metabonomics and metabolomics , 2007 .

[5]  Age K. Smilde,et al.  Assessing the performance of statistical validation tools for megavariate metabolomics data , 2006, Metabolomics.

[6]  Alberto Ferrer,et al.  Chemometric approaches to improve PLSDA model outcome for predicting human non-alcoholic fatty liver disease using UPLC-MS as a metabolic profiling tool , 2011, Metabolomics.

[7]  Bjørn K. Alsberg,et al.  Cross model validation and optimisation of bilinear regression models , 2008 .

[8]  D. Massart,et al.  Elimination of uninformative variables for multivariate calibration. , 1996, Analytical chemistry.

[9]  Paul Geladi,et al.  Principles of Proper Validation: use and abuse of re‐sampling for validation , 2010 .

[10]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[11]  Age K. Smilde,et al.  Discriminant Q2 (DQ2) for improved discrimination in PLSDA models , 2008, Metabolomics.

[12]  Knut Baumann,et al.  Chance Correlation in Variable Subset Regression: Influence of the Objective Function, the Selection Mechanism, and Ensemble Averaging , 2005 .

[13]  P. Filzmoser,et al.  Repeated double cross validation , 2009 .

[14]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[15]  Vartan Choulakian,et al.  Goodness-of-Fit Tests for the Generalized Pareto Distribution , 2001, Technometrics.

[16]  Knut Baumann,et al.  Validation tools for variable subset regression , 2004, J. Comput. Aided Mol. Des..

[17]  Age K. Smilde,et al.  UvA-DARE ( Digital Academic Repository ) Assessment of PLSDA cross validation , 2008 .

[18]  Berkenbos-Smit Statistical data processing in clinical proteomics , 2009 .

[19]  Maria E. Holmboe,et al.  Monte-Carlo methods for determining optimal number of significant variables. Application to mouse urinary profiles , 2009, Metabolomics.

[20]  A. Smilde,et al.  Assessing the statistical validity of proteomics based biomarkers. , 2007, Analytica chimica acta.

[21]  Richard G. Brereton,et al.  Chemometrics for Pattern Recognition , 2009 .

[22]  J. Topliss,et al.  Chance correlations in structure-activity studies using multiple regression analysis , 1972 .