Multi-Objective Genetic Algorithm-Based Sample Selection for Partial Least Squares Model Building with Applications to Near-Infrared Spectroscopic Data

In this study, multi-objective genetic algorithms (GAs) are introduced to partial least squares (PLS) model building. This method aims to improve the performance and robustness of the PLS model by removing samples with systematic errors, including outliers, from the original data. Multi-objective GA optimizes the combination of these samples to be removed. Training and validation sets were used to reduce the undesirable effects of over-fitting on the training set by multi-objective GA. The reduction of the over-fitting leads to accurate and robust PLS models. To clearly visualize the factors of the systematic errors, an index defined with the original PLS model and a specific Pareto-optimal solution is also introduced. This method is applied to three kinds of near-infrared (NIR) spectra to build PLS models. The results demonstrate that multi-objective GA significantly improves the performance of the PLS models. They also show that the sample selection by multi-objective GA enhances the ability of the PLS models to detect samples with systematic errors.

[1]  Tormod Næs,et al.  A user-friendly guide to multivariate calibration and classification , 2002 .

[2]  Philip J. Brown,et al.  Wavelength selection in multicomponent near‐infrared calibration , 1992 .

[3]  Jean-Michel Jolion,et al.  Robust Clustering with Applications in Computer Vision , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  John H. Kalivas,et al.  Global optimization by simulated annealing with wavelength selection for ultraviolet-visible spectrophotometry , 1989 .

[5]  C. B. Lucasius,et al.  Genetic algorithms for large-scale optimization in chemometrics: An application , 1991 .

[6]  P. Rousseeuw Least Median of Squares Regression , 1984 .

[7]  K. Baumann,et al.  A systematic evaluation of the benefits and hazards of variable selection in latent variable regression. Part I. Search algorithm, theory and simulations , 2002 .

[8]  Randy J. Pell,et al.  Multiple outlier detection for multivariate calibration using robust statistical techniques , 2000 .

[9]  Jacques Goupy,et al.  Outliers and experimental designs , 1996 .

[10]  John H. Kalivas,et al.  Further investigation on a comparative study of simulated annealing and genetic algorithm for wavelength selection , 1995 .

[11]  D. B. Hibbert Genetic algorithms in chemistry , 1993 .

[12]  Ron Wehrens,et al.  Wavelength selection with Tabu Search , 2003 .

[13]  D. Massart,et al.  The Mahalanobis distance , 2000 .

[14]  Yukio Yamada,et al.  Noninvasive blood glucose assay using a newly developed near-infrared system , 2003 .

[15]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[16]  Peter J. Rousseeuw,et al.  Robust Regression and Outlier Detection , 2005, Wiley Series in Probability and Statistics.

[17]  Yi-Ping Du,et al.  Simultaneous determination of human serum albumin, gamma-globulin, and glucose in a phosphate buffer solution by near-infrared spectroscopy with moving window partial least-squares regression. , 2003, The Analyst.

[18]  S. Wold,et al.  Wavelength interval selection in multicomponent spectral analysis by moving window partial least-squares regression with applications to mid-infrared and near-infrared spectroscopic data. , 2002, Analytical chemistry.

[19]  Liang Yi-Zeng,et al.  Accuracy criteria and optimal wavelength selection for multicomponent spectrophotometric determinations , 1989 .

[20]  Yukihiro Ozaki,et al.  Selective removal of interference signals for near-infrared spectra of biomedical samples by using region orthogonal signal correction , 2004 .

[21]  Peter J. Fleming,et al.  An Overview of Evolutionary Algorithms in Multiobjective Optimization , 1995, Evolutionary Computation.

[22]  Y. Ozaki,et al.  In Vivo Noninvasive Measurement of Blood Glucose by Near-Infrared Diffuse-Reflectance Spectroscopy , 2003, Applied spectroscopy.

[23]  C. B. Lucasius,et al.  Understanding and using genetic algorithms Part 1. Concepts, properties and context , 1993 .

[24]  Lihua Shen,et al.  Flow injection chemiluminescence determination of epinephrine using epinephrine-imprinted polymer as recognition material , 2003 .