Variable selection using iterative reformulation of training set models for discrimination of samples: application to gas chromatography/mass spectrometry of mouse urinary metabolites.

The paper discusses variable selection as used in large metabolomic studies, exemplified by mouse urinary gas chromatography of 441 mice in three experiments to detect the influence of age, diet, and stress on their chemosignal. Partial least squares discriminant analysis (PLS-DA) was applied to obtain class models, using a procedure of 20,000 iterations including the bootstrap for model optimization and random splits into test and training sets for validation. Variables are selected using PLS regression coefficients on the training set using an optimized number of components obtained from the bootstrap. The variables are ranked in order of significance, and the overall optimal variables are selected as those that appear as highly significant over 100 different test and training set splits. Cost/benefit analysis of performing the model on a reduced number of variables is also illustrated. This paper provides a strategy for properly validated methods for determining which variables are most significant for discriminating between two groups in large metabolomic data sets avoiding the common pitfall of overfitting if variables are selected on a combined training and test set and also taking into account that different variables may be selected each time the samples are split into training and test sets using iterative procedures.