One stop shopping: feature selection, classification and prediction in a single step

We report on the application of a genetic algorithm (GA) for pattern recognition that uses both supervised and transverse learning to mine spectroscopic and proteomic data. The pattern recognition GA selects features that optimize the separation of the classes in a plot of the two or three largest principal components of the data. For training sets with small amounts of labeled data (i.e. data points tagged with a class label) and large amounts of unlabeled data (i.e. data points that are not tagged with a class label), this approach is preferred, as our results show, information in the unlabeled data is used by the fitness function to guide feature selection. The advantages of incorporating transverse learning into the fitness function of the pattern recognition GA have been evaluated in two recently published studies by our group. In one study, Raman spectroscopy and the pattern recognition GA were used to develop a potential method to discriminate hardwoods, softwoods and tropical woods. In a second study, biopsy material of small round blue cell tumors analyzed by cDNA microarrays was identified as to type (Ewings sarcoma, Burkitt's lymphoma, neuroblastoma and rhabdomyosarcoma) through supervised learning implemented by the pattern recognition GA. Copyright © 2011 John Wiley & Sons, Ltd.

[1]  M. Forina,et al.  Multivariate calibration. , 2007, Journal of chromatography. A.

[2]  Barry K. Lavine,et al.  AUTHENTICATION OF FUEL SPILL STANDARDS USING GAS CHROMATOGRAPHY/PATTERN RECOGNITION TECHNIQUES , 2001 .

[3]  Geoffrey J. McLachlan,et al.  Discriminant Analysis and Statistical Pattern Recognition: McLachlan/Discriminant Analysis & Pattern Recog , 2005 .

[4]  Barry K. Lavine,et al.  Genetic algorithms for pattern recognition analysis and fusion of sensor data , 1999, Optics East.

[5]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[6]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[7]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[8]  Barry K. Lavine,et al.  Source identification of underground fuel spills by pattern recognition analysis of high-speed gas chromatograms , 1995 .

[9]  Desire L. Massart,et al.  Detection of inhomogeneities in sets of NIR spectra , 1996 .

[10]  Jerome H. Friedman,et al.  Classification: Oldtimers and newcomers , 1989 .

[11]  B K Lavine,et al.  Application of pyrolysis/gas chromatography/pattern recognition to the detection of cystic fibrosis heterozygotes. , 1985, Analytical chemistry.

[12]  B. K. Lavine,et al.  Statistical Discriminant Analysis , 2009 .

[13]  B K Lavine,et al.  Source identification of underground fuel spills by solid-phase microextraction/high-resolution gas chromatography/genetic algorithms. , 2000, Analytical chemistry.

[14]  Peter C. Jurs,et al.  New index for clustering tendency and its application to chemical problems , 1990, J. Chem. Inf. Comput. Sci..

[15]  Ian R. Lewis,et al.  Raman spectrometry and neural networks for the classification of wood types—1 , 1994 .

[16]  Nikhil Mirjankar,et al.  Pattern recognition analysis of differential mobility spectra with classification by chemical family. , 2006, Analytica chimica acta.

[17]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[18]  Barry K Lavine,et al.  Machine learning based pattern recognition applied to microarray data. , 2004, Combinatorial chemistry & high throughput screening.

[19]  Barry K. Lavine,et al.  Raman Spectroscopy and Genetic Algorithms for the Classification of Wood Types , 2001 .

[20]  B K Lavine,et al.  Fuel spill identification using solid-phase extraction and solid-phase microextraction. 1. Aviation turbine fuels. , 2001, Journal of chromatographic science.

[21]  Barry K. Lavine,et al.  Genetic Algorithms Applied to Pattern Recognition Analysis of High-Speed Gas Chromatograms of Aviation Turbine Fuels Using an Integrated Jet-A/JP-8 Database , 1999 .

[22]  B. Kowalski,et al.  Pattern recognition. Powerful approach to interpreting chemical data , 1972 .

[23]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[24]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[25]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[26]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[27]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[28]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[29]  Alexander Dockhorn,et al.  Classification Algorithms , 2017, Encyclopedia of Machine Learning and Data Mining.

[30]  Richard G. Brereton,et al.  Chemometrics for Pattern Recognition , 2009 .

[31]  Yachao Zhang,et al.  Detection and identification of bacteria using antibiotic susceptibility and a multi-array electrochemical sensor with pattern recognition. , 2007, Biosensors & bioelectronics.

[32]  Barry K. Lavine,et al.  A genetic algorithm for pattern recognition analysis of pyrolysis gas chromatographic data , 1999 .

[33]  Silvana Andreescu,et al.  Multiarray sensors with pattern recognition for the detection, classification, and differentiation of bacteria at subspecies and strain levels. , 2005, Analytical chemistry.

[34]  Kerrie L. Mengersen,et al.  Classification based upon gene expression data: bias and precision of error rates , 2007, Bioinform..

[35]  Barry K. Lavine,et al.  Genetic algorithm for fuel spill identification , 2001 .