Cross model validation and optimisation of bilinear regression models

Abstract Whenever regression models are optimised, it is important that all optimisation steps are properly validated. Variable selection is one example of parameter estimation that will give overly optimistic models if not included in the validation. There are many examples of reported work where the validation is performed posterior to variable selection, and many have correctly noted that these models are optimistically biased. However, if the availability of samples is limited, separation of the data into a training and validation set may decrease the quality of both the calibration model and the validation. Cross model validation is designed to validate the optimisation by including the variable selection in an extra layer of cross-validation. This means that all available samples are utilised both in the training and for estimating the residual error of the model. Cross model validation poses challenging questions both conceptually and algorithmically, and a presentation of the full work-flow is needed. We present a complete framework including optimisation, validation and calibration of bilinear regression models with variable selection. Several issues are addressed that are important for each separate stage of the analysis, and suggestions for improvements are proposed. The method is validated on a gene expression data set with a low signal-to-noise ratio and a small number of samples. It is shown that many replicates are needed to model these data properly, and that cross model validated variable selection improves both the final calibration model and the associated error estimates. A Matlab toolbox (Mathworks Inc, USA) is available from www.specmod.org .

[1]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[2]  D. Allison,et al.  Microarray data analysis: from disarray to consolidation and consensus , 2006, Nature Reviews Genetics.

[3]  J. Downing,et al.  Treatment-specific changes in gene expression discriminate in vivo drug response in human leukemia cells , 2003, Nature Genetics.

[4]  Sijmen de Jong,et al.  DOUBLE-CASE DIAGNOSTIC FOR OUTLIERS IDENTIFICATION , 1996 .

[5]  Ncbi National Center for Biotechnology Information , 2008 .

[6]  Bjørn K. Alsberg,et al.  A framework for significance analysis of gene expression data using dimension reduction methods , 2007, BMC Bioinformatics.

[7]  Paola Gramatica,et al.  The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models , 2003 .

[8]  Johan Trygg,et al.  O2‐PLS, a two‐block (X–Y) latent variable regression (LVR) method with an integral OSC filter , 2003 .

[9]  J. S. Urban Hjorth,et al.  Computer Intensive Statistical Methods: Validation, Model Selection, and Bootstrap , 1993 .

[10]  Douglas M. Hawkins,et al.  Quantitative Structure–Activity Relationship (QSAR) modeling of juvenile hormone activity: Comparison of validation procedures , 2007 .

[11]  S. T. Buckland,et al.  Computer Intensive Statistical Methods: Validation, Model Selection, and Bootstrap , 1993 .

[12]  Robert Tibshirani,et al.  Computer‐Intensive Statistical Methods , 2006 .

[13]  Magni Martens,et al.  Multivariate Analysis of Quality : An Introduction , 2001 .

[14]  D. Edwards,et al.  Statistical Analysis of Gene Expression Microarray Data , 2003 .

[15]  Pierre Dardenne,et al.  Validation and verification of regression in small data sets , 1998 .

[16]  Harald Martens,et al.  Reducing over-optimism in variable selection by cross-model validation , 2006 .

[17]  Beata Walczak,et al.  Pixel‐based analysis of multiple images for the identification of changes: A novel approach applied to unravel proteome patters of 2‐D electrophoresis gel images , 2007 .

[18]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Bjørn K. Alsberg,et al.  Cross model validated feature selection based on gene clusters , 2006 .

[20]  A. Höskuldsson PLS regression methods , 1988 .

[21]  H. Martens,et al.  Modified Jack-knife estimation of parameter uncertainty in bilinear modelling by partial least squares regression (PLSR) , 2000 .

[22]  Edmund R. Malinowski,et al.  Factor Analysis in Chemistry , 1980 .

[23]  Rasmus Bro,et al.  Finding relevant spectral regions between spectroscopic techniques by use of cross model validation and partial least squares regression. , 2007, Analytica chimica acta.