Assessing the performance of statistical validation tools for megavariate metabolomics data

Statistical model validation tools such as cross-validation, jack-knifing model parameters and permutation tests are meant to obtain an objective assessment of the performance and stability of a statistical model. However, little is known about the performance of these tools for megavariate data sets, having, for instance, a number of variables larger than 10 times the number of subjects. The performance is assessed for megavariate metabolomics data, but the conclusions also carry over to proteomics, transcriptomics and many other research areas. Partial least squares discriminant analyses models were built for several LC-MS lipidomic training data sets of various numbers of lean and obese subjects. The training data sets were compared on their modelling performance and their predictability using a 10-fold cross-validation, a permutation test, and test data sets. A wide range of cross-validation error rates was found (from 7.5% to 16.3% for the largest trainings set and from 0% to 60% for the smallest training set) and the error rate increased when the number of subjects decreased. The test error rates varied from 5% to 50%. The smaller the number of subjects compared to the number of variables, the less the outcome of validation tools such as cross-validation, jack-knifing model parameters and permutation tests can be trusted. The result depends crucially on the specific sample of subjects that is used for modelling. The validation tools cannot be used as warning mechanism for problems due to sample size or to representativity of the sampling.

[1]  Ian D Wilson,et al.  Metabonomic analysis of mouse urine by liquid-chromatography-time of flight mass spectrometry (LC-TOFMS): detection of strain, diurnal and gender differences. , 2003, The Analyst.

[2]  P. Good,et al.  Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses , 1995 .

[3]  Pierre Dardenne,et al.  Validation and verification of regression in small data sets , 1998 .

[4]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[5]  K. Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics , 2011 .

[6]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[7]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Age K. Smilde,et al.  Estimating reaction rate constants from a two‐step reaction: a comparison between two‐way and three‐way methods , 2000 .

[9]  A. Derome,et al.  Modern Nmr Techniques for Chemistry Research , 1987 .

[10]  B. Kowalski,et al.  Partial least-squares regression: a tutorial , 1986 .

[11]  T I A Sørensen,et al.  Randomized, multi-center trial of two hypo-energetic diets in obese subjects: high- versus low-fat content , 2006, International Journal of Obesity.

[12]  Johanna Smeyers-Verbeke,et al.  Handbook of Chemometrics and Qualimetrics: Part A , 1997 .

[13]  Trevor Hastie,et al.  Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays , 2003 .

[14]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[15]  Matej Orešič,et al.  The Role of Metabolomics in Systems Biology , 2003 .

[16]  A. Smilde,et al.  Large-scale human metabolomics studies: a strategy for data (pre-) processing and validation. , 2006, Analytical chemistry.

[17]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[18]  Michael Christopher Jewett,et al.  The role of metabolomics in systems biology , 2007 .

[19]  H. Martens,et al.  Modified Jack-knife estimation of parameter uncertainty in bilinear modelling by partial least squares regression (PLSR) , 2000 .

[20]  M. Forina,et al.  Multivariate calibration. , 2007, Journal of chromatography. A.

[21]  Shin Ta Liu,et al.  Permutation Methods: A Distance Function Approach , 2002, Technometrics.

[22]  O. Fiehn Metabolomics – the link between genotypes and phenotypes , 2004, Plant Molecular Biology.

[23]  Christophe Junot,et al.  Metabolite profiling in rat urine by liquid chromatography/electrospray ion trap mass spectrometry. Application to the study of heavy metal toxicity. , 2003, Rapid communications in mass spectrometry : RCM.

[24]  Age K. Smilde,et al.  Modelling of spectroscopic batch process data using grey models to incorporate external information , 2001 .

[25]  S. Toubro,et al.  Fat oxidation before and after a high fat load in the obese insulin-resistant state. , 2006, The Journal of clinical endocrinology and metabolism.

[26]  Eric R. Ziegel,et al.  Handbook of Chemometrics and Qualimetrics, Part B , 2000, Technometrics.

[27]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[28]  R. Goodacre,et al.  Metabolic Profiling: Its Role in Biomarker Discovery and Gene Function Analysis , 2003, Springer US.

[29]  I. Wilson,et al.  Cyclosporin A-induced changes in endogenous metabolites in rat urine: a metabonomic investigation using high field 1H NMR spectroscopy, HPLC-TOF/MS and chemometrics. , 2004, Journal of pharmaceutical and biomedical analysis.

[30]  S. Wold,et al.  Source contributions to ambient aerosol calculated by discriminat partial least squares regression (PLS) , 1988 .

[31]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Ian D Wilson,et al.  HPLC-MS-based methods for the study of metabonomics. , 2005, Journal of chromatography. B, Analytical technologies in the biomedical and life sciences.

[33]  B. Manly Randomization, Bootstrap and Monte Carlo Methods in Biology , 2018 .

[34]  A. Smilde,et al.  Fusion of mass spectrometry-based metabolomics data. , 2005, Analytical chemistry.

[35]  M. Barker,et al.  Partial least squares for discrimination , 2003 .