Multiple Versus Single Set Validation of Multivariate Models to Avoid Mistakes

ABSTRACT Validation of multivariate models is of current importance for a wide range of chemical applications. Although important, it is neglected. The common practice is to use a single external validation set for evaluation. This approach is deficient and may mislead investigators with results that are specific to the single validation set of data. In addition, no statistics are available regarding the precision of a derived figure of merit (FOM). A statistical approach using bootstrapped Latin partitions is advocated. This validation method makes an efficient use of the data because each object is used once for validation. It was reviewed a decade earlier but primarily for the optimization of chemometric models this review presents the reasons it should be used for generalized statistical validation. Average FOMs with confidence intervals are reported and powerful, matched-sample statistics may be applied for comparing models and methods. Examples demonstrate the problems with single validation sets.

[1]  Peter de B. Harrington,et al.  Classification of Jet Fuel Properties by Near-Infrared Spectroscopy Using Fuzzy Rule-Building Expert Systems and Support Vector Machines , 2010, Applied spectroscopy.

[2]  Jürgen Bajorath,et al.  Median Partitioning: A Novel Method for the Selection of Representative Subsets from Large Compound Pools , 2002, J. Chem. Inf. Comput. Sci..

[3]  J. Kalivas Two data sets of near infrared spectra , 1997 .

[4]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[5]  Ronald D. Snee,et al.  Validation of Regression Models: Methods and Examples , 1977 .

[6]  Peter B Harrington,et al.  Bootstrap classification and point-based feature selection from age-staged mouse cerebellum tissues of matrix assisted laser desorption/ionization mass spectra using a fuzzy rule-building expert system. , 2007, Analytica chimica acta.

[7]  Yuhong Xiang,et al.  Application of terahertz time-domain spectroscopy combined with chemometrics to quantitative analysis of imidacloprid in rice samples , 2015 .

[8]  Zhuoyong Zhang,et al.  Identification of rhubarbs by using NIR spectrometry and temperature-constrained cascade correlation networks. , 2006, Talanta.

[9]  Yongjun Wu,et al.  Study on the reaction mechanism and the static injection chemiluminescence method for detection of acetaminophen. , 2013, Luminescence : the journal of biological and chemical luminescence.

[10]  P. Harrington,et al.  Comparison of three algorithms for the baseline correction of hyphenated data objects. , 2014, Analytical chemistry.

[11]  P. Harrington,et al.  Proteomic analysis of amniotic fluids using analysis of variance-principal component analysis and fuzzy rule-building expert systems applied to matrix-assisted laser desorption/ionization mass spectrometry , 2006 .

[12]  Z. Reitermanová Data Splitting , 2010 .

[13]  Yuhong Xiang,et al.  Diagnosis of patients with chronic kidney disease by using two fuzzy classifiers , 2016 .

[14]  P. Harrington,et al.  Validation using sensitivity and target transform factor analyses of neural network models for classifying bacteria from mass spectra , 2002, Journal of the American Society for Mass Spectrometry.

[15]  Fan Yang,et al.  Near infrared spectroscopy combined with least squares support vector machines and fuzzy rule-building expert system applied to diagnosis of endometrial carcinoma. , 2012, Cancer epidemiology.

[16]  S. Keleş,et al.  Sparse partial least squares regression for simultaneous dimension reduction and variable selection , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[17]  Glen P Jackson,et al.  Classification of jet fuels by fuzzy rule-building expert systems applied to three-way data by fast gas chromatography--fast scanning quadrupole ion trap mass spectrometry. , 2011, Talanta.

[18]  Yi-Zeng Liang,et al.  Representative subset selection and outlier detection via isolation forest , 2016 .

[19]  Devanand L. Luthria,et al.  Discrimination among Panax species using spectral fingerprinting. , 2011, Journal of AOAC International.

[20]  P. Harrington,et al.  Authentication of organically and conventionally grown basils by gas chromatography/mass spectrometry chemical profiles. , 2013, Analytical chemistry.

[21]  J. Shao Bootstrap Model Selection , 1996 .

[22]  P. Harrington Automated support vector regression , 2017 .

[23]  Xueguang Shao,et al.  Representative subset selection in modified iterative predictor weighting (mIPW) — PLS models for parsimonious multivariate calibration , 2007 .

[24]  P. Harrington,et al.  Screening GC-MS data for carbamate pesticides with temperature-constrained–cascade correlation neural networks , 2000 .

[25]  H. H. Thodberg,et al.  Optimal minimal neural interpretation of spectra , 1992 .

[26]  Tan Yee Fan,et al.  A Tutorial on Support Vector Machine , 2009 .

[27]  Mohamed Limam,et al.  A kernel distance-based representative subset selection method , 2016 .

[28]  P. Harrington,et al.  Prediction of total antioxidant activity of Prunella L. species by automatic partial least square regression applied to 2-way liquid chromatographic UV spectral images. , 2016, Talanta.

[29]  Zhuoyong Zhang,et al.  Near-infrared spectroscopic applications for diagnosis of endometrial carcinoma. , 2010, Journal of biomedical optics.

[30]  Peter de B Harrington,et al.  Baseline correction method using an orthogonal basis for gas chromatography/mass spectrometry data. , 2011, Analytical chemistry.

[31]  S. Wold,et al.  Orthogonal projections to latent structures (O‐PLS) , 2002 .

[32]  Yi-Zeng Liang,et al.  Monte Carlo cross validation , 2001 .

[33]  Zhuoyong Zhang,et al.  THz-TDS combined with a fuzzy rule-building expert system applied to the identification of official rhubarb samples , 2014 .

[34]  Peter de B. Harrington,et al.  Statistical validation of classification and calibration models using bootstrapped latin partitions , 2006 .

[35]  Jing Zhang,et al.  Microemulsion Electrokinetic Chromatography in Combination with Chemometric Methods to Evaluate the Holistic Quality Consistency and Predict the Antioxidant Activity of Ixeris sonchifolia (Bunge) Hance Injection , 2016, PloS one.

[36]  Discriminant Analysis of Fused Positive and Negative Ion Mobility Spectra Using Multivariate Self-Modeling Mixture Analysis and Neural Networks , 2008, Applied spectroscopy.

[37]  P. Geladi,et al.  Linearization and Scatter-Correction for Near-Infrared Reflectance Spectra of Meat , 1985 .

[38]  Boyan Gao,et al.  Partial least-squares-discriminant analysis differentiating Chinese wolfberries by UPLC-MS and flow injection mass spectrometric (FIMS) fingerprints. , 2014, Journal of agricultural and food chemistry.

[39]  Nathalie Dupuy,et al.  Automated principal component-based orthogonal signal correction applied to fused near infrared-mid-infrared spectra of French olive oils. , 2009, Analytical chemistry.

[40]  P. Harrington,et al.  Fuzzy multivariate rule‐building expert systems: Minimal neural networks , 1991 .

[41]  Yi-Zeng Liang,et al.  Monte Carlo cross‐validation for selecting a model and estimating the prediction error in multivariate calibration , 2004 .

[42]  Peter B Harrington,et al.  Forensic application of gas chromatography-differential mobility spectrometry with two-way classification of ignitable liquids from fire debris. , 2007, Analytical chemistry.

[43]  Reality check on reproducibility , 2016, Nature.

[44]  Desire L. Massart,et al.  Representative subset selection , 2002 .

[45]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[46]  P. Harrington,et al.  Classification of bacteria by simultaneous methylation–solid phase microextraction and gas chromatography/mass spectrometry analysis of fatty acid methyl esters , 2010, Analytical and bioanalytical chemistry.

[47]  Comparison of Flow Injection MS, NMR, and DNA Sequencing: Methods for Identification and Authentication of Black Cohosh (Actaea racemosa). , 2015, Planta medica.

[48]  J. Shao Linear Model Selection by Cross-validation , 1993 .

[49]  Yuhong Xiang,et al.  Terahertz time-domain spectroscopy combined with fuzzy rule-building expert system and fuzzy optimal associative memory applied to diagnosis of cervical carcinoma , 2014, Medical Oncology.

[50]  P. Harrington,et al.  Flow injection mass spectroscopic fingerprinting and multivariate analysis for differentiation of three Panax species. , 2011, Journal of AOAC International.

[51]  Robert D. Clark,et al.  OptiSim: An Extended Dissimilarity Selection Method for Finding Diverse Representative Subsets , 1997, J. Chem. Inf. Comput. Sci..

[52]  William J. Welch,et al.  Computer-aided design of experiments , 1981 .

[53]  Roberto Kawakami Harrop Galvão,et al.  A method for calibration and validation subset partitioning. , 2005, Talanta.

[54]  Yukio Tominaga,et al.  Representative subset selection using genetic algorithms , 1998 .

[55]  G. Cruciani,et al.  Predictive ability of regression models. Part II: Selection of the best predictive PLS model , 1992 .

[56]  Héctor C. Goicoechea,et al.  Representative subset selection and standardization techniques. A comparative study using NIR and a simulated fermentative process UV data , 2007 .

[57]  B. Skagerberg,et al.  Predictive ability of regression models. Part I: Standard deviation of prediction errors (SDEP) , 1992 .