Comparing Standard Deviations

In one of the sessions at Chambersburg a speaker (and I've forgotten who it was) put up a slide in which the performance of various calibration methods was compared by calculating the standard deviation of prediction errors on a validation set for each method. The question which then arose in discussion was: how different do two such standard deviations have to be before we can be reasonably sure that the apparently better method really is better and the results wouldn't be reversed if we took another validation set? This column is an attempt to provide an answer to this question, which turns out to be more difficult than it might at first appear. First, let me restate the problem more precisely. Suppose two different prediction methods have been calibrated to predict y from x~ the NIR context, y is some lab measurement and x, usually multivariate, is spectral data. The prediction methods might be very similar, e.g. two multiple regression equations, or very different, e.g. a neural network with 10 principal components as input and a simple regression equation based on the ratio of two derivative terms. The calibration may have been done on the same or different calibration data. All that matters is that for a given x, each method will produce a prediction of y. Then suppose the two methods are compared by taking a single validation set of n samples with known x and y and predicting y from x using each method. Since the true y is known, this gives a set of n prediction errors e for each method. The validation set should not have been used in either calibration procedure, a point I shall return to at more length in a future column. One way of summarising these results is to calculate the mean, m, and the standard deviation, 5, of the n errors for each method, the standard deviation being the square root of the sum of squared differences from the mean divided by n 1. Another (and arguably more relevant) summary is the root mean square error, i.e. the square root of the sum of squared errors divided by n, which combines the bias (or mean error) and standard deviation in a single measure. However, the statistical arguments turn out to be easier if you separate the two, so that is the case I will deal with. The formulae, for those who prefer algebra to words are