Cross-validatory selection of test and validation sets in multivariate calibration and neural networks as applied to spectroscopy.

Cross-validated and non-cross-validated regression models using principal component regression (PCR), partial least squares (PLS) and artificial neural networks (ANN) have been used to relate the concentrations of polycyclic aromatic hydrocarbon pollutants to the electronic absorption spectra of coal tar pitch volatiles. The different trends in the cross-validated and non-cross-validated results are discussed as well as a method for the production of a true cross-validated neural network regression model. It is shown that the methods must be compared through the errors produced in the validation sets as well as those given for the final model. Various methods for calculation of errors are described and compared. The separation of training, validation and test sets into fully independent groups is emphasized. PLS outperforms PCR using all indicators. ANNs are inferior to multivariate techniques for individual compounds but are reasonably effective in predicting the sum of PAHs in the mixture set.