Feature selection versus feature compression in the building of calibration models from FTIR-spectrophotometry datasets.

Undoubtedly, FTIR-spectrophotometry has become a standard in chemical industry for monitoring, on-the-fly, the different concentrations of reagents and by-products. However, representing chemical samples by FTIR spectra, which spectra are characterized by hundreds if not thousands of variables, conveys their own set of particular challenges because they necessitate to be analyzed in a high-dimensional feature space, where many of these features are likely to be highly correlated and many others surely affected by noise. Therefore, identifying a subset of features that preserves the classifier/regressor performance seems imperative prior any attempt to build an appropriate pattern recognition method. In this context, we investigate the benefit of utilizing two different dimensionality reduction methods, namely the minimum Redundancy-Maximum Relevance (mRMR) feature selection scheme and a new self-organized map (SOM) based feature compression, coupled to regression methods to quantitatively analyze two-component liquid samples utilizing FTIR spectrophotometry. Since these methods give us the possibility of selecting a small subset of relevant features from FTIR spectra preserving the statistical characteristics of the target variable being analyzed, we claim that expressing the FTIR spectra by these dimensionality-reduced set of features may be beneficial. We demonstrate the utility of these novel feature selection schemes in quantifying the distinct analytes within their binary mixtures utilizing a FTIR-spectrophotometer.

[1]  Antonella Macagnano,et al.  Electronic-nose modelling and data analysis using a self-organizing map , 1997 .

[2]  T. Næs,et al.  Principal component regression in NIR analysis: Viewpoints, background details and selection of components , 1988 .

[3]  R. Boggia,et al.  Genetic algorithms as a strategy for feature selection , 1992 .

[4]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Norman R. Draper,et al.  Applied regression analysis (2. ed.) , 1981, Wiley series in probability and mathematical statistics.

[6]  Ran El-Yaniv,et al.  Distributional Word Clusters vs. Words for Text Categorization , 2003, J. Mach. Learn. Res..

[7]  Douglas N. Rutledge,et al.  GENETIC ALGORITHM APPLIED TO THE SELECTION OF PRINCIPAL COMPONENTS , 1998 .

[8]  J. K. Amamcharla,et al.  Application of vapour-phase Fourier transform infrared spectroscopy (FTIR) and statistical feature selection methods for identifying Salmonella enterica typhimurium contamination in beef. , 2010 .

[9]  J. Brezmes,et al.  Building parsimonious fuzzy ARTMAP models by variable selection with a cascaded genetic algorithm: application to multisensor systems for gas analysis , 2004 .

[10]  C. B. Lucasius,et al.  Genetic algorithms in wavelength selection: a comparative study , 1994 .

[11]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[12]  Satoshi Kawata,et al.  Optimal Wavelength Selection for Quantitative Analysis , 1986 .

[13]  Jianguo Sun,et al.  A correlation principal component regression analysis of NIR data , 1995 .

[14]  J. Chalmers,et al.  Handbook of vibrational spectroscopy , 2002 .

[15]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[16]  Eduard Llobet,et al.  On the use of a self organising map as feature compressor in the building of calibration models: Application to FTIR-spectrophotometry , 2011 .

[17]  R. Leardi Application of a genetic algorithm to feature selection under full validation conditions and to outlier detection , 1994 .

[18]  U Depczynski,et al.  Genetic algorithms applied to the selection of factors in principal component regression , 2000 .

[19]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[20]  Desire L. Massart,et al.  Random correlation in variable selection for multivariate calibration with a genetic algorithm , 1996 .

[21]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[22]  N. Draper,et al.  Applied Regression Analysis , 1966 .

[23]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[24]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[25]  E. Smidt,et al.  Characterization of Waste Materials Using FTIR Spectroscopy: Process Monitoring and Quality Assessment , 2005 .

[26]  John H. Kalivas,et al.  Global optimization by simulated annealing with wavelength selection for ultraviolet-visible spectrophotometry , 1989 .

[27]  Cosimo Distante,et al.  Drift counteraction with multiple self-organising maps for an electronic nose , 2004 .

[28]  Po-Heng Lee,et al.  Evaluation of sewage sludge-based compost by FT-IR spectroscopy , 2006 .

[29]  E. Smidt,et al.  Characterization of different decomposition stages of biowaste using FT-IR spectroscopy and pyrolysis-field ionization mass spectrometry , 2005, Biodegradation.

[30]  B. Kowalski,et al.  Partial least-squares regression: a tutorial , 1986 .

[31]  P. Zaccheo,et al.  Organic Matter Characterization of Composts From Different Feedstocks , 2002 .

[32]  E. Smidt,et al.  Prediction of humic acid content and respiration activity of biogenic waste by means of Fourier transform infrared (FTIR) spectra and partial least squares regression (PLS-R) models. , 2007, Talanta.

[33]  P. Griffiths Fourier Transform Infrared Spectrometry , 2007 .

[34]  Riccardo Leardi,et al.  Genetic Algorithms as a Tool for Wavelength Selection in Multivariate Calibration , 1995 .