Artificially generated near-infrared spectral data for classification purposes

Abstract Near-Infrared Spectroscopy has became a widely used analytical technique in different research fields due to its non-destructiveness and low-cost. The spectra are rich in information but extremely complex, therefore their analysis necessitates the use of advanced statistical methods. The empirical properties of the statistical methods can be assessed using artificially generated data that resemble real Near-Infrared Spectroscopy. In this paper we propose a new data generation approach (ABS) that takes into account the theoretical knowledge about the near-infrared absorption of the functional groups. The proposed method is compared to real data and to a simpler data generation method, which simulates the data from a multivariate normal distribution whose parameters are estimated from real data (MVNorig). The comparison between real data and the data generation approaches is based on a class-imbalanced classification problem using linear discriminant analysis, classification trees and support vector machines. Both simulation approaches generated spectra with a good resemblance to real data, MVNorig performing slightly better than ABS; using real and simulated data we would have reached similar conclusions about the class-imbalance problem in classification. Both methods can be used to artificially generate near-infrared spectra. The method based on multivariate normal distribution can be used when a large number of real data spectra is available, while the appropriateness of the results of the ABS method depend on the exactness of functional group near-infrared absorption knowledge.

[1]  Steven D. Brown,et al.  Wavelet analysis applied to removing non‐constant, varying spectroscopic background in multivariate calibration , 2002 .

[2]  Rok Blagus,et al.  Class prediction for high-dimensional class-imbalanced data , 2010, BMC Bioinformatics.

[3]  Y. Roggo,et al.  A review of near infrared spectroscopy and chemometrics in pharmaceutical technologies. , 2007, Journal of pharmaceutical and biomedical analysis.

[4]  B. Ripley,et al.  Recursive Partitioning and Regression Trees , 2015 .

[5]  J. Coates Interpretation of Infrared Spectra, A Practical Approach , 2006 .

[6]  Bugao Xu,et al.  Characterization of Degradation of Cotton Cellulosic Fibers Through Near Infrared Spectroscopy , 2013, Journal of Polymers and the Environment.

[7]  Jerome J. Workman,et al.  Practical Guide and Spectral Atlas for Interpretive Near , 2012 .

[8]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[9]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[10]  Ron Kohavi,et al.  Wrappers for performance enhancement and oblivious decision graphs , 1995 .

[11]  P. Burman A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods , 1989 .

[12]  Guiyun Wang,et al.  A novel DPSO-SVM system for variable interval selection of endometrial tissue sections by near infrared spectroscopy. , 2013, Talanta.

[13]  Bahram Hemmateenejad,et al.  Construction of stable multivariate calibration models using unsupervised segmented principal component regression , 2011 .

[14]  P. Williams,et al.  Chemical principles of near-infrared technology , 1987 .

[15]  Frans van den Berg,et al.  Review of the most common pre-processing techniques for near-infrared spectra , 2009 .

[16]  Yufeng Ge,et al.  A new perspective to near-infrared reflectance spectroscopy : A wavelet approach , 2007 .

[17]  Michael Schilling,et al.  Identification of historical polymers using Near-Infrared Spectroscopy , 2014 .

[18]  Karin Fackler,et al.  A Review of Band Assignments in near Infrared Spectra of Wood and Wood Components , 2011 .

[19]  Timo Mantere,et al.  A Review of Optical Nondestructive Visual and Near-Infrared Methods for Food Quality and Safety , 2013 .

[20]  Kurt Hornik,et al.  Misc Functions of the Department of Statistics (e1071), TU Wien , 2014 .

[21]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[22]  M. Jamrógiewicz Application of the near-infrared spectroscopy in the pharmaceutical technology. , 2012, Journal of pharmaceutical and biomedical analysis.

[23]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[24]  P. Wentzell,et al.  Characterization of heteroscedastic measurement noise in the absence of replicates. , 2014, Analytica chimica acta.

[25]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[26]  V. Nampoori,et al.  Overtone spectra of styrene and polystyrene in the visible and near infrared regions , 1989 .

[27]  Israel Schechter,et al.  Wavelength Selection for Simultaneous Spectroscopic Analysis. Experimental and Theoretical Study , 1996 .

[28]  Peter D. Wentzell,et al.  Comparison of principal components regression and partial least squares regression through generic simulations of complex mixtures , 2003 .

[29]  Hai-bin Qu,et al.  Background correction in near-infrared spectra of plant extracts by orthogonal signal correction. , 2005, Journal of Zhejiang University. Science. B.

[30]  G. Socrates,et al.  Infrared and Raman characteristic group frequencies : tables and charts , 2001 .

[31]  Emil W. Ciurczak,et al.  Handbook of Near-Infrared Analysis , 1992 .

[32]  Nirav Bhatt,et al.  Multivariate calibration of non-replicated measurements for heteroscedastic errors , 2007 .

[33]  Heinz W. Siesler,et al.  The Assignment of Overtone and Combination Bands in the near Infrared Spectrum of Polyamide 11 , 1999 .

[34]  Masoumeh Hasani,et al.  Selection of individual variables versus intervals of variables in PLSR , 2010 .

[35]  Ronald R. Coifman,et al.  The prediction error in CLS and PLS: the importance of feature selection prior to multivariate calibration , 2005 .

[36]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[37]  D Cozzolino,et al.  Identification of transgenic foods using NIR spectroscopy: a review. , 2010, Spectrochimica acta. Part A, Molecular and biomolecular spectroscopy.