The Importance of Balanced Data Sets for Partial Least Squares Discriminant Analysis: Classification Problems Using Hyperspectral Imaging Data

This study investigates the effect of imbalanced spectral data in the training set, when developing partial least squares discriminant analysis (PLS-DA) classification models for use in future predictions. The experimental study was performed using a real hyperspectral short-wavelength infrared image data set collected from bakery products (buns) containing contaminants (flies) but similar applications for other insects, paper and plastic were also tested. The contaminants represent a very small proportion of the images relative to the bun. The PLS-DA model aims at accurately detecting and classifying the contaminants and this requires a modification of the calibration data set. The paper deals with problems caused by unbalanced calibration data sets and how to remedy them. In the example it was demonstrated that, by balancing the calibration data from 58,476 bun pixels + 279 fly pixels to 279 bun + 279 fly pixels, the number of true predictions could be improved with a smaller number of PLS components used in the model. The improvement for flies increased from 65% true predictions with ten PLS components to > 99% true prediction with five to six PLS components. The true prediction for bun went from 100% to 99.5% with six PLS components which is an acceptable reduction. Theoretical explanations are included.

[1]  K. Baumann Multivariate Datenanalyse, methodik und anwendung in der chemie und verwandten gebieten , 1996 .

[2]  S. Wold,et al.  Orthogonal projections to latent structures (O‐PLS) , 2002 .

[3]  P. Geladi,et al.  Multivariate image analysis , 1996 .

[4]  Jacek M. Zurada,et al.  Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance , 2008, Neural Networks.

[5]  Roberto Todeschini,et al.  A new algorithm for optimal, distance based, experimental design , 1992 .

[6]  Svante Wold,et al.  The utility of multivariate design in PLS modeling , 2004 .

[7]  A. Höskuldsson PLS regression methods , 1988 .

[8]  Michael Y. Hu,et al.  A joint investigation of misclassification treatments and imbalanced datasets on neural network performance , 2009, Neural Computing and Applications.

[9]  U. Edlund,et al.  Visualization of GC/TOF-MS-based metabolomics data for identification of biochemically interesting compounds using OPLS class models. , 2008, Analytical chemistry.

[10]  Peter Filzmoser,et al.  Introduction to Multivariate Statistical Analysis in Chemometrics , 2009 .

[11]  Richard G. Brereton,et al.  Chemometrics: Data Analysis for the Laboratory and Chemical Plant , 2003 .

[12]  Paul Geladi,et al.  Techniques and applications of hyperspectral image analysis , 2007 .

[13]  Yukihiro Ozaki,et al.  Raman, Infrared, and Near-Infrared Chemical Imaging: Sasic/Chemical Imaging , 2010 .

[14]  Richard G. Brereton,et al.  Chemometrics for Pattern Recognition , 2009 .