Dimensionality reduction using genetic algorithms

Pattern recognition generally requires that objects be described in terms of a set of measurable features. The selection and quality of the features representing each pattern affect the success of subsequent classification. Feature extraction is the process of deriving new features from original features to reduce the cost of feature measurement, increase classifier efficiency, and allow higher accuracy. Many feature extraction techniques involve linear transformations of the original pattern vectors to new vectors of lower dimensionality. While this is useful for data visualization and classification efficiency, it does not necessarily reduce the number of features to be measured since each new feature may be a linear combination of all of the features in the original pattern vector. Here, we present a new approach to feature extraction in which feature selection and extraction and classifier training are performed simultaneously using a genetic algorithm. The genetic algorithm optimizes a feature weight vector used to scale the individual features in the original pattern vectors. A masking vector is also employed for simultaneous selection of a feature subset. We employ this technique in combination with the k nearest neighbor classification rule, and compare the results with classical feature selection and extraction techniques, including sequential floating forward feature selection, and linear discriminant analysis. We also present results for the identification of favorable water-binding sites on protein surfaces.

[1]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1977, Journal of molecular biology.

[2]  Francesc J. Ferri,et al.  Comparative study of techniques for large-scale feature selection* *This work was suported by a SERC grant GR/E 97549. The first author was also supported by a FPI grant from the Spanish MEC, PF92 73546684 , 1994 .

[3]  Crosby Jl,et al.  Computers in the study of evolution. , 1967 .

[4]  Alain Biem,et al.  Pattern recognition using discriminative feature extraction , 1997, IEEE Trans. Signal Process..

[5]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[6]  J. Reed,et al.  Simulation of biological evolution and machine learning. I. Selection of self-reproducing numeric patterns by data processing machines, effects of hereditary control, mutation type and crossing. , 1967, Journal of theoretical biology.

[7]  Richard J. Enbody,et al.  Further Research on Feature Selection and Classification Using Genetic Algorithms , 1993, ICGA.

[8]  W. Punch,et al.  Predicting conserved water-mediated and polar ligand interactions in proteins using a K-nearest-neighbors genetic algorithm. , 1997, Journal of molecular biology.

[9]  Anil K. Jain,et al.  39 Dimensionality and sample size considerations in pattern recognition practice , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[10]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[11]  J L Crosby,et al.  Computers in the study of evolution. , 1967, Science in progress.

[12]  Alex Fraser,et al.  Simulation of Genetic Systems by Automatic Digital Computers I. Introduction , 1957 .

[13]  David B. Fogel,et al.  Evolutionary Computation: The Fossil Record , 1998 .

[14]  Anil K. Jain,et al.  Artificial neural networks for feature extraction and multivariate data projection , 1995, IEEE Trans. Neural Networks.

[15]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Anil K. Jain,et al.  Bootstrap Techniques for Error Estimation , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Hisao Ishibuchi,et al.  Adaptive fuzzy rule-based classification systems , 1996, IEEE Trans. Fuzzy Syst..

[18]  Hisao Ishibuchi,et al.  Selecting fuzzy if-then rules for classification problems using genetic algorithms , 1995, IEEE Trans. Fuzzy Syst..

[19]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[20]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[21]  J. Ross Quinlan,et al.  Simplifying decision trees , 1987, Int. J. Hum. Comput. Stud..

[22]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[23]  As Fraser,et al.  Simulation of Genetic Systems by Automatic Digital Computers VII. Effects of Reproductive Ra'l'e, and Intensity of Selection, on Genetic Structure , 1960 .

[24]  Paul Compton,et al.  Inductive knowledge acquisition: a case study , 1987 .

[25]  David A. Landgrebe,et al.  Decision boundary feature extraction for neural networks , 1997, IEEE Trans. Neural Networks.

[26]  J. Tainer,et al.  Atomic and residue hydrophilicity in the context of folded protein structures , 1995, Proteins.

[27]  L. Kuhn,et al.  The role of structure in antibody cross-reactivity between peptides and folded proteins. , 1998, Journal of molecular biology.

[28]  Erik D. Goodman,et al.  Simultanous Feature Extraction and Selection Using a Genetic Algorithm , 1997, ICGA.

[29]  David A. Landgrebe,et al.  Decision boundary feature extraction for neural networks , 1992, [Proceedings] 1992 IEEE International Conference on Systems, Man, and Cybernetics.

[30]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[31]  R. Galen,et al.  The Assessment of Laboratory Tests in the Diagnosis of Acute Appendicitis , 1984 .

[32]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[33]  Anil K. Jain,et al.  Parsimonious network design and feature selection through node pruning , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[34]  Jack Sklansky,et al.  A note on genetic algorithms for large-scale feature selection , 1989, Pattern Recognit. Lett..

[35]  J. Ross Quinlan,et al.  Simplifying Decision Trees , 1987, Int. J. Man Mach. Stud..

[36]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[37]  Sholom M. Weiss,et al.  An Empirical Comparison of Pattern Recognition, Neural Nets, and Machine Learning Classification Methods , 1989, IJCAI.

[38]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[39]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[40]  Lawrence Davis,et al.  Hybridizing the Genetic Algorithm and the K Nearest Neighbors Classification Algorithm , 1991, ICGA.

[41]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..