A hybrid classification method: discrete canonical variate analysis using a genetic algorithm

Abstract This paper describes a novel, hybrid multivariate classification method: discrete canonical variate analysis (DCVA), which is integrated in the present implementation with a genetic algorithm (GA). DCVA transforms a multivariate data set into a set of discrete scores of lower dimensionality, intended specifically to act as classifiers of observations into one out of multiple pre-defined groups. The condition for selecting the DCVA loadings is maximization of the ratio of the between-groups to within-groups variance of the scores, but unlike conventional CVA, there is a non-linear, discontinuous relationship between the scores and loadings. The performance of the DCVA method is compared with that of two competing classification methods, Artificial Neural Networks (ANNs) and Mahalanobis distance-based Linear discriminant analysis (LDA) using six example problems. In all cases, internal (leave-one-out) cross-validation was used, and classification success rates retained from both the training and test segments. Of the methods studied, DCVA clearly performed the best in training, producing the highest mean success rates for four out of the six example data sets. For the test segments, DCVA produced the best performance for two of the data sets, and equalled that of LDA and ANN for a third. However, LDA produced the best performance from the remaining three data sets. This is suggestive of a greater tendency of DCVA, like other search-based methods, to overfit.

[1]  Desire L. Massart,et al.  Reference data sets for chemometrical methods testing , 1993 .

[2]  E. K. Kemsley,et al.  Potential of Fourier transform infrared spectroscopy for the authentication of vegetable oils , 1994 .

[3]  Wojtek J. Krzanowski,et al.  Principles of multivariate analysis : a user's perspective. oxford , 1988 .

[4]  T. Næs,et al.  Multivariate strategies for classification based on NIR-spectra—with application to mayonnaise , 1999 .

[5]  Antonio Bellacicco,et al.  Handbook of statistics 2: Classification, pattern recognition and reduction of dimensionality: P.R. KRISHNAIAH and L.N. KANAL (Eds.) North-Holland, Amsterdam, 1982, xxii + 903 pages, Dfl.275.00 , 1984 .

[6]  Kimito Funatsu,et al.  GA Strategy for Variable Selection in QSAR Studies: GA-Based PLS Analysis of Calcium Channel Antagonists , 1997, J. Chem. Inf. Comput. Sci..

[7]  Rasmus Bro,et al.  Algorithm for finding an interpretable simple neural network solution using PLS , 1995 .

[8]  E. K. Kemsley,et al.  THE USE AND MISUSE OF CHEMOMETRICS FOR TREATING CLASSIFICATION PROBLEMS , 1997 .

[9]  Laveen N. Kanal,et al.  Classification, Pattern Recognition and Reduction of Dimensionality , 1982, Handbook of Statistics.

[10]  Y K Kirby,et al.  Evaluation of logistic versus linear regression models for predicting pulmonary hypertension syndrome (ascites) using cold exposure or pulmonary artery clamp models in broilers. , 1997, Poultry science.

[11]  Hiroshi Yoshida,et al.  Optimization of the Inner Relation Function of QPLS Using Genetic Algorithm , 1997, J. Chem. Inf. Comput. Sci..

[12]  Peisheng Cong,et al.  Combining nonlinear PLS with the numeric genetic algorithm for QSAR , 1999 .

[13]  E. K. Kemsley,et al.  Detection of adulteration of raspberry purees using infrared spectroscopy and chemometrics , 1996 .

[14]  E. K. Kemsley,et al.  A genetic algorithm (GA) approach to the calculation of canonical variates (CVs) , 1998 .

[15]  E. K. Kemsley,et al.  Discriminant analysis of high-dimensional data: a comparison of principal components analysis and partial least squares data reduction methods , 1996 .

[16]  E. K. Kemsley,et al.  Avoiding overfitting in the analysis of high-dimensional data with artificial neural networks (ANNs). , 1999, The Analyst.

[17]  W. J. Krzanowski,et al.  Nonparametric Confidence and Tolerance Regions in Canonical Variate Analysis , 1989 .