Monte-Carlo methods for determining optimal number of significant variables. Application to mouse urinary profiles

Three methods for variable selection are described, namely the t-statistic, Partial Least Squares Discriminant Analysis (PLS-DA) weights and regression coefficients, with the aim of determining which variables are the most significant markers for discriminating between two groups: a variable’s level of significance is related to its magnitude. Monte-Carlo methods are employed to determine empirical significance of variables, by permuting randomly the class membership 5000 times to obtain null distributions, and comparing the observed statistic for each variable with the null distribution. Seven simulations consisting of 200 samples, divided equally between two classes, and 300 variables, are constructed; in one dataset there are no induced correlations between variables, in two datasets correlations are induced but there is no induced separation between the classes, and in four datasets, separation is induced by selecting 20 of the variables to be discriminators. In addition two metabolomic datasets were analysed consisting of the GCMS of urinary extracts from mice both to determine the effect of stress and to determine the effect of diet on the urinary chemosignal. It is shown that the t-statistic combined with Monte-Carlo permutations provides similar results to PLS weights. PLS regression coefficients find the least number of markers but, for the simulations, the lowest False Positives rates.

[1]  Richard G. Brereton,et al.  Introduction to multivariate calibration in analytical chemistry , 2000 .

[2]  Chengshan Xiao Improved -Sensitivity for , 1997 .

[3]  Richard G. Brereton,et al.  Chemometrics for Pattern Recognition , 2009 .

[4]  H. M. Heise,et al.  Rapid and reliable spectral variable selection for statistical calibrations based on PLS-regression vector choices , 1997 .

[5]  J. E. Jackson A User's Guide to Principal Components , 1991 .

[6]  S. Wold,et al.  A randomization test for PLS component selection , 2007 .

[7]  S. Wold Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models , 1978 .

[8]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[9]  Avraham Lorber,et al.  Net analyte signal calculation in multivariate calibration , 1997 .

[10]  S. Wold,et al.  Partial least squares analysis with cross‐validation for the two‐class problem: A Monte Carlo study , 1987 .

[11]  D. Penn,et al.  Comparison of human axillary odour profiles obtained by gas chromatography/mass spectrometry and skin microbial profiles obtained by denaturing gradient gel electrophoresis using multivariate pattern recognition , 2007, Metabolomics.

[12]  A. Hope A Simplified Monte Carlo Significance Test Procedure , 1968 .

[13]  M. Zhuo,et al.  Altered Stress-Induced Anxiety in Adenylyl Cyclase Type VIII-Deficient Mice , 2000, The Journal of Neuroscience.

[14]  F. Marriott,et al.  Barnard's Monte Carlo Tests: How Many Simulations? , 1979 .

[15]  Yi-Zeng Liang,et al.  Monte Carlo cross validation , 2001 .

[16]  Paul Geladi,et al.  Principal Component Analysis , 1987, Comprehensive Chemometrics.

[17]  D. Penn,et al.  An automated method for peak detection and matching in large gas chromatography‐mass spectrometry data sets , 2006 .

[18]  B. Kowalski,et al.  Partial least-squares regression: a tutorial , 1986 .

[19]  R. Poppi,et al.  PLS pruning: a new approach to variable selection for multivariate calibration based on Hessian matrix of errors , 2005 .

[20]  T. Moritz,et al.  A multivariate screening strategy for investigating metabolic effects of strenuous physical exercise in human serum. , 2007, Journal of proteome research.

[21]  Julie Wilson,et al.  Novel feature selection method for genetic programming using metabolomic 1H NMR data , 2006 .

[22]  M. Barker,et al.  Partial least squares for discrimination , 2003 .

[23]  Douglas B. Kell,et al.  Genetic algorithms as a method for variable selection in multiple linear regression and partial least squares regression, with applications to pyrolysis mass spectrometry , 1997 .

[24]  A. Smilde,et al.  Large-scale human metabolomics studies: a strategy for data (pre-) processing and validation. , 2006, Analytical chemistry.

[25]  Maria E. Holmboe,et al.  Use of cluster separation indices and the influence of outliers: application of two new separation indices, the modified silhouette index and the overlap coefficient to simulated data and mouse urine metabolomic profiles , 2009 .

[26]  Bruce R. Kowalski,et al.  Tensorial calibration: I. First‐order calibration , 1988 .

[27]  D B Kell,et al.  Variable selection in discriminant partial least-squares analysis. , 1998, Analytical chemistry.

[28]  C. Metz Basic principles of ROC analysis. , 1978, Seminars in nuclear medicine.

[29]  Richard G. Brereton,et al.  Chemometrics: Data Analysis for the Laboratory and Chemical Plant , 2003 .

[30]  Tingjun Hou,et al.  Conformational analysis of peptides using Monte Carlo simulations combined with the genetic algorithm , 1999 .

[31]  Eric R. Ziegel,et al.  Statistics and Chemometrics for Analytical Chemistry , 2004, Technometrics.

[32]  Johan Lindberg,et al.  A strategy for modelling dynamic responses in metabolic samples characterized by GC/MS , 2006, Metabolomics.

[33]  David M. Haaland,et al.  Improved Sensitivity of Infrared Spectroscopy by the Application of Least Squares Methods , 1980 .

[34]  H. Heise,et al.  Calibration method for the infrared-spectrometric trace gas analysis , 1985 .

[35]  W. Matson,et al.  Metabolomic profiling to develop blood biomarkers for Parkinson's disease. , 2008, Brain : a journal of neurology.

[36]  E. V. Thomas,et al.  Partial least-squares methods for spectral analyses. 1. Relation to other quantitative calibration methods and the extraction of qualitative information , 1988 .

[37]  J. Edward Jackson,et al.  A User's Guide to Principal Components. , 1991 .

[38]  James P. Egan,et al.  Signal detection theory and ROC analysis , 1975 .

[39]  A. Höskuldsson PLS regression methods , 1988 .

[40]  Royston Goodacre,et al.  Genetic algorithm optimization for pre-processing and variable selection of spectroscopic data , 2005, Bioinform..

[41]  A. Höskuldsson Variable and subset selection in PLS regression , 2001 .

[42]  Richard G. Brereton,et al.  Pattern Recognition of Gas Chromatography Mass Spectrometry of Human Volatiles in Sweat to distinguish the sex of subjects and determine potential Discriminatory Marker Peaks , 2007 .

[43]  Russ Rew,et al.  NetCDF: an interface for scientific data access , 1990, IEEE Computer Graphics and Applications.

[44]  D. Massart,et al.  Feature selection for the discrimination between pollution types with partial least squares modelling , 1996 .

[45]  Yun Xu,et al.  Diagnostic Pattern Recognition on Gene-Expression Profile Data by Using One-Class Classification , 2005, J. Chem. Inf. Model..

[46]  J. Edward Jackson,et al.  A User's Guide to Principal Components: Jackson/User's Guide to Principal Components , 2004 .

[47]  Hein Putter,et al.  The bootstrap: a tutorial , 2000 .

[48]  Fan Gong,et al.  Application of dissimilarity indices, principal coordinates analysis, and rank tests to peak tables in metabolomics of the gas chromatography/mass spectrometry of human sweat. , 2007, Analytical chemistry.