A combination strategy of random forest and back propagation network for variable selection in spectral calibration

Abstract Random forest (RF) and neural network have received significant interest for statistical data analysis as a result of their good predictive performance and attractive analytical properties. When developing a RF regression model for spectral analysis, some informative wavelengths are supposed to be selected so as to reduce dimension effectively and improve interpretability. Whereas a neural network has the merit of restoring high signals in data. A chemometric strategy was proposed in this paper, implemented through the combined use of the RF algorithm and back propagation (BP) network. The RF-selected informative wavelengths were further refined by a moderate 3-layer BP network, where the number of hidden nodes was tunable and finally determined by searching the minimum output error. The BP network was trained with the combined running of RF to generate a new comprehensive variable, so that a renewal informative-plus-net variable group could be produced. This renewed group of variables (or this selected group of variables) was used in a multiple linear regression model to predict the spectral analytical ability in quantitatively determining the content of the target analyte. The application case was based on the Fourier transform near infrared dataset of soil samples, aiming to chemometrically determine the content of the nutritional organic carbon. The prediction results indicated that the proposed strategy of combining RF and BP network can improve prediction accuracy and enhance model interpretability in comparison with the general RF method and the conventional benchmark partial least squares regression. The methodology presented here is of practical significance and has wide application in rapid nutrition determination in the development of precise agriculture.

[1]  R. Edrada-Ebel,et al.  A chemometric study of chromatograms of tea extracts by correlation optimization warping in conjunction with PCA, support vector machines and random forest data modeling. , 2009, Analytica chimica acta.

[2]  Francisco Alonso-Sarría,et al.  Modification of the random forest algorithm to avoid statistical dependence problems when classifying remote sensing imagery , 2017, Comput. Geosci..

[3]  H. Flessa,et al.  Use of mid-infrared spectroscopy in the diffuse-reflectance mode for the prediction of the composition of organic matter in soil and litter , 2008 .

[4]  Anurag S. Rathore,et al.  Chemometrics applications in biotech processes: Assessing process comparability , 2012, Biotechnology progress.

[5]  R. Cassella,et al.  Determination of total protein in hyperimmune serum samples by near-infrared spectrometry and multivariate calibration. , 2010, Analytical biochemistry.

[6]  Long Xue,et al.  Application of Particle Swarm Optimization (PSO) Algorithm to Determine Dichlorvos Residue on the Surface of Navel Orange with Vis-NIR Spectroscopy , 2012 .

[7]  Roberto Kawakami Harrop Galvão,et al.  A method for calibration and validation subset partitioning. , 2005, Talanta.

[8]  Nathaniel C. Bantayan,et al.  Using Genetic Algorithm Neural Network on Near Infrared Spectral Data for Ripeness Grading of Oil Palm (Elaeis guineensis Jacq.) Fresh Fruit , 2016 .

[9]  Cheng-Ching Yu,et al.  On the interaction between measurement strategy and control performance in semiconductor manufacturing , 2008 .

[10]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[11]  B. Gallo,et al.  Feasibility study of FT-MIR spectroscopy and PLS-R for the fast determination of anthocyanins in wine. , 2012, Talanta.

[12]  Hari Niwas Mishra,et al.  FT-NIR spectroscopy for caffeine estimation in instant green tea powder and granules , 2009 .

[13]  Nakjoong Kim,et al.  Spectral Range Optimization for the Near-Infrared Quantitative Analysis of Petrochemical and Petroleum Products: Naphtha and Gasoline , 2006, Applied spectroscopy.

[14]  V. K. Giri,et al.  Feature selection and classification of mechanical fault of an induction motor using random forest classifier , 2016 .

[15]  Ken Cai,et al.  Use of random forest in FTIR analysis of LDL cholesterol and tri‐glycerides for hyperlipidemia , 2015, Biotechnology progress.

[16]  D. R. Cutler,et al.  Utah State University From the SelectedWorks of , 2017 .

[17]  Hoeil Chung,et al.  Random forest as a potential multivariate method for near-infrared (NIR) spectroscopic analysis of complex mixture samples: Gasoline and naphtha , 2013 .

[18]  Hoeil Chung Applications of Near‐Infrared Spectroscopy in Refineries and Important Issues to Address , 2007 .

[19]  H. Martens,et al.  Light scattering and light absorbance separated by extended multiplicative signal correction. application to near-infrared transmission analysis of powder mixtures. , 2003, Analytical chemistry.

[20]  J. E. Guerrero,et al.  Use of Artificial Neural Networks in Near-Infrared Reflectance Spectroscopy Calibrations for Predicting the Inclusion Percentages of Wheat and Sunflower Meal in Compound Feedingstuffs , 2006, Applied spectroscopy.

[21]  Xiang-zhong Song,et al.  Comparison of several supervised pattern recognition techniques for detecting additive methamidophos in rotenone preparation by near-infrared spectroscopy. , 2014, Spectrochimica acta. Part A, Molecular and biomolecular spectroscopy.

[22]  Bo Wang,et al.  Using near-infrared process analysis to study gas–solid adsorption process as well as its data treatment based on artificial neural network and partial least squares , 2011 .

[23]  Bin Wang,et al.  Quantitative analysis of diclofenac sodium powder via near-infrared spectroscopy combined with artificial neural network. , 2009, Journal of pharmaceutical and biomedical analysis.

[24]  A. McBratney,et al.  Near-infrared (NIR) and mid-infrared (MIR) spectroscopic techniques for assessing the amount of carbon stock in soils – Critical review and research perspectives , 2011 .

[25]  Onisimo Mutanga,et al.  A comparison of regression tree ensembles: Predicting Sirex noctilio induced water stress in Pinus patula forests of KwaZulu-Natal, South Africa , 2010, Int. J. Appl. Earth Obs. Geoinformation.

[26]  Bingren Xiang,et al.  Application of artificial neural network to determination of active principle ingredient in pharmaceutical quality control based on near infrared spectroscopy , 2008 .

[27]  J. L. García-Aróstegui,et al.  Identifying the origin of groundwater samples in a multi-layer aquifer system with Random Forest classification , 2013 .

[28]  C. Guerrero,et al.  Near infrared spectroscopy for determination of various physical, chemical and biochemical properties in Mediterranean soils. , 2008, Soil biology & biochemistry.

[29]  Vural Aksakalli,et al.  Risk assessment in social lending via random forests , 2015, Expert Syst. Appl..

[30]  Bjoern H. Menze,et al.  A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data , 2009, BMC Bioinformatics.

[31]  Yizeng Liang,et al.  A perspective demonstration on the importance of variable selection in inverse calibration for complex analytical systems. , 2013, The Analyst.

[32]  L. Gleser Measurement, Regression, and Calibration , 1996 .

[33]  M. Almond,et al.  Book reviewPractical NIR spectroscopy: By B. G. Osborne, T. Fearn & P. H. Hindle. Longmans, UK, 1993. 227pp. ISBN 0582-099463. Price: £65.00 , 1994 .

[34]  F. Melgani,et al.  A two-stage regression approach for spectroscopic quantitative analysis , 2011 .

[35]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[36]  Naif Alajlan,et al.  Active learning for spectroscopic data regression , 2012 .

[37]  Maria Fernanda Pimentel,et al.  Projection pursuit and PCA associated with near and middle infrared hyperspectral images to investigate forensic cases of fraudulent documents , 2017 .

[38]  Oguz Gungor,et al.  Evaluation of random forest method for agricultural crop classification , 2012 .

[39]  R. V. Rossel,et al.  Visible, near infrared, mid infrared or combined diffuse reflectance spectroscopy for simultaneous assessment of various soil properties , 2006 .

[40]  Y. Allouche,et al.  Near Infrared Spectroscopy and Artificial Neural Network to Characterise Olive Fruit and Oil Online for Process Optimisation , 2015 .

[41]  D. Cozzolino,et al.  The prediction of total anthocyanin concentration in red-grape homogenates using visible-near-infrared spectroscopy and artificial neural networks. , 2007, Analytica chimica acta.