Application of SIC (simple interval calculation) for object status classification and outlier detection—comparison with regression approach

We introduce a novel approach termed simple interval calculation (SIC) for classification of object status in linear multivariate calibration (MVC) and other data analytical contexts. SIC is a method that directly constructs an interval estimator for the predicted response. SIC is based on the single assumption that all errors involved in MVC are limited. We present the theory of the SIC method and explain its realization by linear programming techniques. The primary SIC consequence is a radically new object classification that can be interpreted using a two‐dimensional object status plot (OSP), ‘SIC residual vs SIC leverage’. These two new measures of prediction quality are introduced in the traditional chemometric MVC context. Simple straight demarcations divide the OSP into areas which quantitatively discriminate all objects involved in modeling and prediction into four different types: boundary samples, which are the significant objects (for generating the entire data structure) within the training subset; insiders, which are samples that comply with the model; outsiders, which are samples that have large prediction errors; and finally outliers, which are those samples that cannot be predicted at all with respect to a given model. We also present detailed comparisons of the new SIC approach with traditional chemometric methods applied for MVC, classification and outlier detection. These comparisons employ four real‐world data sets, selected for their particular complexities, which serve as showcases of SIC application on intricate training and test set data structures. Copyright © 2005 John Wiley & Sons, Ltd.

[1]  F. Eicker Asymptotic Normality and Consistency of the Least Squares Estimators for Families of Linear Regressions , 1963 .

[2]  A. Hossain,et al.  A comparative study on detection of influential observations in linear regression , 1991 .

[3]  R. Cook Detection of influential observation in linear regression , 2000 .

[4]  George B. Dantzig,et al.  Linear programming and extensions , 1965 .

[5]  E. Lehmann Testing Statistical Hypotheses , 1960 .

[6]  D. F. Andrews,et al.  Finding the Outliers that Matter , 1978 .

[7]  VERNON J. CLANCEY,et al.  Statistical Methods in Chemical Analyses , 1947, Nature.

[8]  D. B. Hibbert Multivariate calibration and classification - T. Naes, T. Isaksson, T. Fearn and T. Davis, NIR Publications, Chichester, 2002, ISBN 0 9528666 2 5, UK @$45.00, US$75.00 , 2004 .

[9]  R. Rajkó Treatment of Model Error in Calibration by Robust and Fuzzy Procedures , 1994 .

[10]  B. Kowalski,et al.  Multivariate instrument standardization , 1991 .

[11]  Tormod Næs,et al.  A user-friendly guide to multivariate calibration and classification , 2002 .

[12]  Celio Pasquini,et al.  A strategy for selecting calibration samples for multivariate modelling , 2004 .

[13]  D. Massart,et al.  Standardisation of near-infrared spectrometric instruments: A review , 1996 .

[14]  Peter J. Huber,et al.  Robust Statistics , 2005, Wiley Series in Probability and Statistics.

[15]  R. Cook Influential Observations in Linear Regression , 1979 .

[16]  H. Martens,et al.  Variable Selection in near Infrared Spectroscopy Based on Significance Testing in Partial Least Squares Regression , 2000 .

[17]  Nicolaas M. Faber,et al.  Comparison of two recently proposed expressions for partial least squares regression prediction error , 2000 .

[18]  D. L. Massart,et al.  Characterisation of the representativity of selected sets of samples in multivariate calibration and pattern recognition , 1997 .

[19]  Yudi Pawitan,et al.  Variable selection in random calibration of near‐infrared instruments: ridge regression and partial least squares regression settings , 2003 .

[20]  Harald Martens,et al.  REVIEW OF PARTIAL LEAST SQUARES REGRESSION PREDICTION ERROR IN UNSCRAMBLER , 1998 .

[21]  Bruce R. Kowalski,et al.  Propagation of measurement errors for the validation of predictions obtained by principal component regression and partial least squares , 1997 .

[22]  A. Höskuldsson Variable and subset selection in PLS regression , 2001 .

[23]  E. Gumbel,et al.  Statistics of extremes , 1960 .

[24]  J. A. John,et al.  Influential Observations and Outliers in Regression , 1981 .

[25]  A. Tucker,et al.  Linear Inequalities And Related Systems , 1956 .

[26]  M. Hubert,et al.  A robust PCR method for high‐dimensional regressors , 2003 .

[27]  T. Næs The design of calibration in near infra‐red reflectance analysis by clustering , 1987 .

[28]  Antony R. Unwin,et al.  Operations Research — An Introduction (2nd edn) , 1980 .

[29]  Desire L. Massart,et al.  Methods for outlier detection in prediction , 2002 .

[30]  Yi-Zeng Liang,et al.  On simplex-based method for self-modeling curve resolution of two-way data , 2003 .