Rough set based wavelength selection in near-infrared spectral analysis

Abstract Rough set based procedure was proposed as a new methodology to select component-specific wavelengths for near-infrared (NIR) spectral analysis. Information gain (IG) was employed to regulate the size of the discernibility matrix and decrease the memory requirements of rough set based reduction. This procedure involved submitting the resulting subsets of wavelengths to the analytical models in question. The utility of this method was illustrated by an analysis of classification models for phenylalanine (Phe) in plasma. The wavelength selection algorithm was compared with correlation based feature selection (CRFS) method and consistency based feature selection (CSFS) approach. Model fit was assessed using 10-fold cross-validation (10-fold CV) and leave-one out (LOO) approach. The predictability of the model was evaluated by an external prediction set. Furthermore, another two NIR data sets, obtained from the published literatures, were used to develop the quantitative models and validate the rough set based wavelength selection method. This study demonstrates conclusively that reducts of rough set could preserve the spectra–structure relationship and provide reliable model variables for NIR analysis. The results also indicate that rough set algorithm may hold promise for application as an additional feasible technique to NIR band assignment. As a fast, simple and noninvasive measurement, it is hopeful to find a clinical use in the diagnosis of unusual Phe elevation with further research.

[1]  Raymond J. Mooney,et al.  Creating diversity in ensembles using artificial data , 2005, Inf. Fusion.

[2]  Hiroki Sato,et al.  Practicality of Wavelength Selection to Improve Signal-to-noise Ratio in Near-infrared Spectroscopy , 2003 .

[3]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[4]  Chikako Yomota,et al.  Near-Infrared Analysis of Hydrogen-Bonding in Glass- and Rubber-State Amorphous Saccharide Solids , 2009, AAPS PharmSciTech.

[5]  Alex Alves Freitas,et al.  Hierarchical classification of G-Protein-Coupled Receptors with data-driven selection of attributes and classifiers , 2009 .

[6]  Nick Cercone,et al.  Integrating rough set theory and medical applications , 2008, Appl. Math. Lett..

[7]  Erik Andries,et al.  Spectral Multivariate Calibration with Wavelength Selection Using Variants of Tikhonov Regularization , 2010, Applied spectroscopy.

[8]  M. Lowenthal,et al.  Comparison of orthogonal liquid and gas chromatography-mass spectrometry platforms for the determination of amino acid concentrations in human plasma. , 2010, Journal of chromatography. A.

[9]  N. Blau,et al.  Phenotyping and treatment of phenylketonuria – Authors' reply , 2011, The Lancet.

[10]  D. Cheillan,et al.  Amino acid profiling for the diagnosis of inborn errors of metabolism. , 2011, Methods in molecular biology.

[11]  P. Carlini,et al.  Vis-NIR measurement of soluble solids in cherry and apricot by PLS regression and wavelength selection. , 2000, Journal of agricultural and food chemistry.

[12]  Ramlan Mahmod,et al.  Rough neural expert systems , 2000 .

[13]  K. Kano,et al.  Factors Influencing Self-Aggregation Tendencies of Cationic Porphyrins in Aqueous Solution , 2000 .

[14]  Roman M. Balabin,et al.  Variable selection in near-infrared spectroscopy: benchmarking of feature selection methods on biodiesel data. , 2011, Analytica chimica acta.

[15]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[16]  M. Dyrby,et al.  Chemometric Quantitation of the Active Substance (Containing C≡N) in a Pharmaceutical Tablet Using Near-Infrared (NIR) Transmittance and NIR FT-Raman Spectra , 2002 .

[17]  Chatchawit Aporntewan,et al.  Gene hunting of the Genetic Analysis Workshop 16 rheumatoid arthritis data using rough set theory , 2009, BMC proceedings.

[18]  K. Thangavel,et al.  Dimensionality reduction based on rough set theory: A review , 2009, Appl. Soft Comput..

[19]  Jerzy W. Grzymala-Busse,et al.  Mining Mass Spectrometry Database Search Results - A Rough Set Approach , 2007, RSEISP.

[20]  C. Kaye Newborn Screening Fact Sheets , 2006, Pediatrics.

[21]  Zdzislaw Pawlak,et al.  Rough classification , 1984, Int. J. Hum. Comput. Stud..

[22]  L. A. Stone,et al.  Computer Aided Design of Experiments , 1969 .

[23]  C. Spiegelman,et al.  Theoretical Justification of Wavelength Selection in PLS Calibration:  Development of a New Algorithm. , 1998, Analytical Chemistry.

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  Jiang Wang,et al.  Prediction of protein structural class with Rough Sets , 2006, BMC Bioinformatics.

[26]  Ludovic Duponchel,et al.  Parallel genetic algorithm co-optimization of spectral pre-processing and wavelength selection for PLS regression , 2011 .

[27]  Zou Xiaobo,et al.  Variables selection methods in near-infrared spectroscopy. , 2010, Analytica chimica acta.

[28]  Huan Liu,et al.  Consistency-based search in feature selection , 2003, Artif. Intell..

[29]  Teng Wang,et al.  Rough set-based SAR analysis: An inductive method , 2010, Expert Syst. Appl..

[30]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[31]  W. Gronwald,et al.  Advances in amino acid analysis , 2009, Analytical and bioanalytical chemistry.

[32]  R. Słowiński Intelligent Decision Support: Handbook of Applications and Advances of the Rough Sets Theory , 1992 .

[33]  R. Yu,et al.  An ensemble of Monte Carlo uninformative variable elimination for wavelength selection. , 2008, Analytica chimica acta.

[34]  Desire L. Massart,et al.  Rough sets theory , 1999 .

[35]  Monica Casale,et al.  Chemometrical strategies for feature selection and data compression applied to NIR and MIR spectra of extra virgin olive oils for cultivar identification. , 2010, Talanta.

[36]  Tianlu Chen,et al.  Metabolic profiling reveals disorder of amino acid metabolism in four brain regions from a rat model of chronic unpredictable mild stress , 2008, FEBS letters.

[37]  Jerry Workman,et al.  Practical guide to interpretive near-infrared spectroscopy , 2007 .

[38]  T. Isaksson,et al.  Studies on the structure of water using two-dimensional near-infrared correlation spectroscopy and principal component analysis. , 2001, Analytical chemistry.

[39]  Xueguang Shao,et al.  A wavelength selection method based on randomization test for near-infrared spectral analysis , 2009 .

[40]  B. Desbat,et al.  The use of near-infra-red spectroscopy coupled to the polarization modulation technique to investigate molecular orientation in uniaxially stretched polymers , 1995 .

[41]  C. Tran,et al.  Determination of enantiomeric compositions of amino acids by near-infrared spectrometry through complexation with carbohydrate. , 2003, Analytical chemistry.

[42]  X. Shao,et al.  Simultaneous Wavelength Selection and Outlier Detection in Multivariate Regression of Near-Infrared Spectra , 2005, Analytical sciences : the international journal of the Japan Society for Analytical Chemistry.

[43]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[44]  Qiang Shen,et al.  Rough sets, their extensions and applications , 2007, Int. J. Autom. Comput..

[45]  Mark R Smith,et al.  Optimisation of partial least squares regression calibration models in near-infrared spectroscopy: a novel algorithm for wavelength selection. , 2003, The Analyst.

[46]  C. Pasquini,et al.  A Flow System for Generation of Concentration Perturbation in Two-Dimensional Correlation Near-Infrared Spectroscopy: Application to Variable Selection in Multivariate Calibration , 2010, Applied spectroscopy.

[47]  F. Spronsen Phenylketonuria: a 21st century perspective , 2010, Nature Reviews Endocrinology.

[48]  Di Wu,et al.  Prediction of protein interaction hot spots using rough set-based multiple criteria linear programming. , 2011, Journal of theoretical biology.

[49]  M A Arnold,et al.  Genetic algorithm-based wavelength selection for the near-infrared determination of glucose in biological matrixes: initialization strategies and effects of spectral resolution. , 1998, Analytical chemistry.

[50]  Huazhou Chen,et al.  Waveband selection for NIR spectroscopy analysis of soil organic matter based on SG smoothing and MWPLS methods , 2011 .

[51]  J. A. Westerhuis,et al.  New Indicator for Optimal Preprocessing and Wavelength Selection of Near-Infrared Spectra , 2004, Applied spectroscopy.

[52]  Igor V. Tetko,et al.  Gene selection from microarray data for cancer classification - a machine learning approach , 2005, Comput. Biol. Chem..

[53]  H. Siesler,et al.  Near-infrared spectroscopy:principles,instruments,applications , 2002 .

[54]  Søren Balling Engelsen,et al.  Rapid Spectroscopic Analysis of Marzipan—Comparative Instrumentation , 2004 .

[55]  Huiqing Liu,et al.  A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. , 2002, Genome informatics. International Conference on Genome Informatics.

[56]  H. Goicoechea,et al.  Monitoring substrate and products in a bioprocess with FTIR spectroscopy coupled to artificial neural networks enhanced with a genetic-algorithm-based method for wavelength selection. , 2006, Talanta.

[57]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[58]  Z. Pawlak Rough set approach to knowledge-based decision support , 1997 .

[59]  A. Macdonald,et al.  Blood phenylalanine control in phenylketonuria: a survey of 10 European centres , 2011, European Journal of Clinical Nutrition.