Sampling error profile analysis (SEPA) for model optimization and model evaluation in multivariate calibration

A novel method called sampling error profile analysis (SEPA) based on Monte Carlo sampling and error profile analysis is proposed for outlier detection, cross validation, pretreatment method and wavelength selection, and model evaluation in multivariate calibration. With the Monte Carlo sampling in SEPA, a number of submodels are prepared and the subsequent error profile analysis yields a median and a standard deviation of the root‐mean‐square error (RMSE) for the submodels. The median coupled with the standard deviation is an estimation of the RMSE that is more predictive and robust because it uses representative submodels produced by Monte Carlo sampling, unlike the normal method, which uses only 1 model. The error profile analysis also calculates skewness and kurtosis for an auxiliary judgment of the estimated RMSE, which is useful for model optimization and model evaluation. The proposed method is evaluated with 3 near‐infrared datasets for wheat, corn, and tobacco. The results show that SEPA can diagnose outliers with more parameters, select more reasonable pretreatment method and wavelength points, and evaluate the model more accurately and precisely. Compared with the results reported in published papers, a better model could be obtained with SEPA concerning RMSECV, RMSEC, and RMSEP estimated with an independent prediction set.

[1]  Ting Wu,et al.  Improvement of NIR model by fractional order Savitzky–Golay derivation (FOSGD) coupled with wavelength selection , 2015 .

[2]  Kaiyi Zheng,et al.  Pretreating near infrared spectra with fractional order Savitzky–Golay differentiation (FOSGD) , 2015 .

[3]  Xueguang Shao,et al.  Multivariate calibration of near-infrared spectra by using influential variables , 2012 .

[4]  Dong-Sheng Cao,et al.  A new strategy of outlier detection for QSAR/QSPR , 2009, J. Comput. Chem..

[5]  Dong-Sheng Cao,et al.  Recipe for revealing informative metabolites based on model population analysis , 2010, Metabolomics.

[6]  José Camacho,et al.  On the use of the observation‐wise k‐fold operation in PCA cross‐validation , 2015 .

[7]  W. Cai,et al.  Outlier detection in near-infrared spectroscopic analysis by using Monte Carlo cross-validation , 2008 .

[8]  Rupert G. Miller The jackknife-a review , 1974 .

[9]  Yi-Zeng Liang,et al.  Monte Carlo cross validation , 2001 .

[10]  Kaiyi Zheng,et al.  A Robust Near-Infrared Calibration Model for the Determination of Chlorophyll Concentration in Tree Leaves with a Calibration Transfer Method , 2015 .

[11]  Dong-Sheng Cao,et al.  Model population analysis for variable selection , 2010 .

[12]  Dong-Sheng Cao,et al.  A new strategy to prevent over-fitting in partial least squares models based on model population analysis. , 2015, Analytica chimica acta.

[13]  R. Lyman Ott.,et al.  An introduction to statistical methods and data analysis , 1977 .

[14]  R. Dennis Cook,et al.  Cross-Validation of Regression Models , 1984 .

[15]  Alina A. von Davier,et al.  Cross-Validation , 2014 .

[16]  Pierre Dardenne,et al.  Validation and verification of regression in small data sets , 1998 .

[17]  Xueguang Shao,et al.  Detecting influential observations by cluster analysis and Monte Carlo cross-validation. , 2010, The Analyst.

[18]  J. Brian Gray,et al.  Introduction to Linear Regression Analysis , 2002, Technometrics.

[19]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[20]  Dong-Sheng Cao,et al.  Model-population analysis and its applications in chemical and biological modeling , 2012 .

[21]  Dong-Sheng Cao,et al.  Recipe for uncovering predictive genes using support vector machines based on model population analysis , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[22]  J. Shao Linear Model Selection by Cross-validation , 1993 .

[23]  Qing-Song Xu,et al.  Fisher optimal subspace shrinkage for block variable selection with applications to NIR spectroscopic analysis , 2016 .

[24]  Kaiyi Zheng,et al.  Stability competitive adaptive reweighted sampling (SCARS) and its applications to multivariate calibration of NIR spectra , 2012 .

[25]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[26]  Paola Gramatica,et al.  Principles of QSAR models validation: internal and external , 2007 .

[27]  Yi-Zeng Liang,et al.  Calibration transfer of near‐infrared spectra for extraction of informative components from spectra with canonical correlation analysis , 2014 .

[28]  Ö. U. Erzengin,et al.  Diagnostics of calibration methods: model adequacy of UV‐based determinations , 2016 .