Optimizing wavelength selection by using informative vectors for parsimonious infrared spectra modelling

Abstract Infrared spectroscopy has been widely adopted by various agricultural research. The typical spectra variables contain thousands of wavelengths. These large number of spectra variables often contribute to collinearity, and redundancies rather than relevant information. Variable selection of the predictors is an important step to create a robust calibration model from these spectra data. This paper presents an algorithm for spectra variable selection based on a combination of informative vectors and an ordered predictor selection (OPS) approach with an exponentially decreasing function (EDF) selection. Informative vectors are features derived from statistical principles that can be used to describe the relationship between the dependent variables and the predictors (spectra). The informative vectors analysed include regression coefficient vector (b), variable influence on projection (V), residual vector (S), net analyte signal vector (Na), linear correlation vector (COR), biweight mid-correlation vector (BIC), mutual information based on adjacency matrix (AMI), covariance procedures matrix (COV). These eight informative vectors can be joined in pairs and become 22 combination vectors. This approach was tested with near-infrared soil spectra for predicting the properties of pH, clay and sand content, cation exchange capacity (CEC), and total carbon content. This example used the Cubist regression tree and partial least squares regression (PLSR) models for calibration. By utilizing the subset of the spectra (retaining those that are significant based on the absolute values of the informative vectors), the regression models were still able to enhance the prediction capability. Overall, the PLSR model performed better than the Cubist model. The informative vector b (and its combinations) and S (and its combinations) were found to be the ones that provide the most accurate predictions for this dataset. Although the performance of the subset model does not perform better than the full spectra model, the number of wavelengths variable used in the model is significantly reduced to, on average, 25%.

[1]  R. Henry,et al.  Simultaneous Determination of Moisture, Organic Carbon, and Total Nitrogen by Near Infrared Reflectance Spectrophotometry , 1986 .

[2]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[3]  J. Rodgers,et al.  Thirteen ways to look at the correlation coefficient , 1988 .

[4]  R. V. Rossel,et al.  Using data mining to model and interpret soil diffuse reflectance spectra. , 2010 .

[5]  S. Engelsen,et al.  Interval Partial Least-Squares Regression (iPLS): A Comparative Chemometric Study with an Example from Near-Infrared Spectroscopy , 2000 .

[6]  Kang He,et al.  Assessment of important soil properties related to Chinese Soil Taxonomy based on vis-NIR reflectance spectroscopy , 2018, Comput. Electron. Agric..

[7]  D. Wu,et al.  Short-wave near-infrared spectroscopy of milk powder for brand identification and component analysis. , 2008, Journal of dairy science.

[8]  K. Shepherd,et al.  Development of Reflectance Spectral Libraries for Characterization of Soil Properties , 2002 .

[9]  R. Teófilo,et al.  Sorting variables by using informative vectors as a strategy for feature selection in multivariate regression , 2009 .

[10]  R. Wilcox Introduction to Robust Estimation and Hypothesis Testing , 1997 .

[11]  Thomas Kemper,et al.  Estimate of heavy metal contamination in soils after a mining accident using reflectance spectroscopy. , 2002, Environmental science & technology.

[12]  N K Faber,et al.  Efficient computation of net analyte signal vector in inverse multivariate calibration models. , 1998, Analytical chemistry.

[13]  Satoru Tsuchikawa,et al.  Near-infrared spectroscopic assessment of contamination level of sewage. , 2010, Water science and technology : a journal of the International Association on Water Pollution Research.

[14]  Jiewen Zhao,et al.  Selection of the efficient wavelength regions in FT-NIR spectroscopy for determination of SSC of ‘Fuji’ apple based on BiPLS and FiPLS models , 2007 .

[15]  R. N. M. J. Páscoa,et al.  Exploratory study on vineyards soil mapping by visible/near-infrared spectroscopy of grapevine leaves , 2016, Comput. Electron. Agric..

[16]  Zou Xiaobo,et al.  Variables selection methods in near-infrared spectroscopy. , 2010, Analytica chimica acta.

[17]  Suhas P. Wani,et al.  Variable indicators for optimum wavelength selection in diffuse reflectance spectroscopy of soils , 2016 .

[18]  L. Duponchel,et al.  Support vector machines (SVM) in near infrared (NIR) spectroscopy: Focus on parameters optimization and model interpretation , 2009 .

[19]  Xudong Sun,et al.  NIR sensitive wavelength selection based on different methods , 2010, 2010 International Conference on Mechanic Automation and Control Engineering.

[20]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[21]  Nitin K. Tripathi,et al.  Artificial neural network analysis of laboratory and in situ spectra for the estimation of macronutrients in soils of Lop Buri (Thailand) , 2003 .

[22]  Ga Miller,et al.  Note on the bias of information estimates , 1955 .

[23]  Hongdong Li,et al.  Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration. , 2009, Analytica chimica acta.

[24]  A. Savitzky,et al.  Smoothing and Differentiation of Data by Simplified Least Squares Procedures. , 1964 .

[25]  Budiman Minasny,et al.  Potential of integrated field spectroscopy and spatial analysis for enhanced assessment of soil contamination: A prospective review , 2015 .

[26]  Yong He,et al.  Determination of tea polyphenols content by infrared spectroscopy coupled with iPLS and random frog techniques , 2015, Comput. Electron. Agric..

[27]  René Gislum,et al.  Separation of viable and non-viable tomato (Solanum lycopersicum L.) seeds using single seed near-infrared spectroscopy , 2017, Comput. Electron. Agric..

[28]  Lin Song,et al.  Comparison of co-expression measures: mutual information, correlation, and model based indices , 2012, BMC Bioinformatics.

[29]  M. C. U. Araújo,et al.  The successive projections algorithm for variable selection in spectroscopic multicomponent analysis , 2001 .

[30]  Vincent Baeten,et al.  Oil and Fat Classification by Selected Bands of Near-Infrared Spectroscopy , 2000 .

[31]  R. V. Rossel,et al.  Visible, near infrared, mid infrared or combined diffuse reflectance spectroscopy for simultaneous assessment of various soil properties , 2006 .

[32]  C. Hurburgh,et al.  Near-Infrared Reflectance Spectroscopy–Principal Components Regression Analyses of Soil Properties , 2001 .

[33]  R. V. Rossel,et al.  Visible and near infrared spectroscopy in soil science , 2010 .

[34]  Gerrit Kateman,et al.  Optimization of calibration data with the dynamic genetic algorithm , 1992 .

[35]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[36]  Alex B. McBratney,et al.  Simultaneous estimation of several soil properties by ultra-violet, visible, and near-infrared reflectance spectroscopy , 2003 .

[37]  Philippe Lagacherie,et al.  Continuum removal versus PLSR method for clay and calcium carbonate content estimation from laboratory and airborne hyperspectral measurements , 2008 .

[38]  Reinhold Carle,et al.  Evaluation of fruit authenticity and determination of the fruit content of fruit products using FT-NIR spectroscopy of cell wall components , 2010 .

[39]  Evandro Bona,et al.  Partial least square with discriminant analysis and near infrared spectroscopy for evaluation of geographic and genotypic origin of arabica coffee , 2016, Comput. Electron. Agric..

[40]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[41]  Michael Vohland,et al.  Determination of soil properties with visible to near- and mid-infrared spectroscopy: Effects of spectral variable selection , 2014 .

[42]  Tahir Mehmood,et al.  A review of variable selection methods in Partial Least Squares Regression , 2012 .

[43]  Agnar Höskuldsson,et al.  COVPROC method: strategy in modeling dynamic systems , 2003 .

[44]  R. M. Lark,et al.  Improved analysis and modelling of soil diffuse reflectance spectra using wavelets , 2009 .

[45]  K. Shepherd,et al.  Global soil characterization with VNIR diffuse reflectance spectroscopy , 2006 .

[46]  D. Massart,et al.  Elimination of uninformative variables for multivariate calibration. , 1996, Analytical chemistry.

[47]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[48]  S. Wold,et al.  PLS: Partial Least Squares Projections to Latent Structures , 1993 .