Random forest–based estimation of heavy metal concentration in agricultural soils with hyperspectral sensor data

Heavy metals in the agricultural soils of reclaimed mining areas can contaminate food and endanger human health. The objective of this study is to effectively estimate the concentrations of heavy metals, such as zinc, chromium, arsenic, and lead, using hyperspectral sensor data and the random forest (RF) algorithm in the study area of Xuzhou, China. The RF’s built-in feature selection ability and modeling expressive ability in heavy metal estimation of soil were explored. After the preprocessing of the spectrum obtained by an ASD (analytical spectral device) field spectrometer, the random forest algorithm was carried out to establish the estimation model based on the correlation-selected features and the full-spectrum features respectively. Results of all the different processes were compared with classical approaches, such as partial least squares (PLS) regression and support vector machine (SVM). In all the experimental results, from the perspective of models, the best estimation model for Zn (R2 = 0.9061; RMSE = 6.5008) is based on the full-spectrum data of continuum removal (CR) pretreatment, and the best models for Cr (R2 = 0.9110; RMSE = 4.5683), As (R2 = 0.9912; RMSE = 0.5327), and Pb (R2 = 0.9756; RMSE = 1.1694) are all derived from the correlation-selected features. And these best models of these heavy metals are all established by the RF method. The experiments in this paper show that random forests can make full use of the input spectral data in the estimation of four kinds of heavy metals, and the obtained models are superior to those established by traditional methods.

[1]  Min Huang,et al.  Comparison of Data Pre-processing in Pattern Recognition of Milk Powder Vis/NIR Spectra , 2006, ADMA.

[2]  Binggan Wei,et al.  A review of heavy metal contaminations in urban soils, urban road dusts and agricultural soils from China. , 2010 .

[3]  Mariana Belgiu,et al.  Random forest in remote sensing: A review of applications and future directions , 2016 .

[4]  Sizhe Huang,et al.  A Remote Sensing Ship Recognition Using Random Forest , 2016, CloudCom 2016.

[5]  J. M. Soriano-Disla,et al.  The Performance of Visible, Near-, and Mid-Infrared Reflectance Spectroscopy for Prediction of Soil Physical, Chemical, and Biological Properties , 2014 .

[6]  H. H. Madden Comments on the Savitzky-Golay convolution method for least-squares-fit smoothing and differentiation of digital data , 1976 .

[7]  D. P. Franzmeier,et al.  Characterization of Iron Oxide Minerals by Second-Derivative Visible Spectroscopy 1 , 1984 .

[8]  F. Meer Spectral curve shape matching with a continuum removed CCSM algorithm , 2000 .

[9]  Li-hua Xu,et al.  Effects of Pretreatment Methods and Bands Selection on Soil Nutrient Hyperspectral Evaluation , 2011 .

[10]  Freek D. van der Meer,et al.  Mapping of heavy metal pollution in stream sediments using combined geochemistry, field spectroscopy, and hyperspectral remote sensing: A case study of the Rodalquilar mining area, SE Spain , 2008 .

[11]  T. Fearn,et al.  On the geometry of SNV and MSC , 2009 .

[13]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[14]  Xinkai Zhu,et al.  Estimation of biomass in wheat using random forest regression algorithm and remote sensing data , 2016 .

[15]  Kellie J. Archer,et al.  Empirical characterization of random forest variable importance measures , 2008, Comput. Stat. Data Anal..

[16]  David J. Chittleborough,et al.  Visible near-infrared reflectance spectroscopy as a predictive indicator of soil properties , 2011 .

[17]  W. Dean Hively,et al.  Strategies for Soil Quality Assessment Using Visible and Near‐Infrared Reflectance Spectroscopy in a Western Kenya Chronosequence , 2012 .

[18]  G. Irace,et al.  Second-derivative spectroscopy of proteins. A method for the quantitative determination of aromatic amino acids in proteins. , 1978, European journal of biochemistry.

[19]  Donald C. Rundquist,et al.  Comparison of NIR/RED ratio and first derivative of reflectance in estimating algal-chlorophyll concentration: A case study in a turbid reservoir , 1997 .

[20]  Joydeep Ghosh,et al.  Investigation of the random forest framework for classification of hyperspectral data , 2005, IEEE Transactions on Geoscience and Remote Sensing.

[21]  Nadia Aguerssif,et al.  Simultaneous determination of Fe(III) and Al(III) by first-derivative spectrophotometry and partial least-squares (PLS-2) method - application to post-haemodialysis fluids. , 2008, Journal of trace elements in medicine and biology : organ of the Society for Minerals and Trace Elements.

[22]  Zhihao Qin,et al.  Possibilities of reflectance spectroscopy for the assessment of contaminant elements in suburban soils , 2005 .

[23]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[24]  A. Savitzky,et al.  Smoothing and Differentiation of Data by Simplified Least Squares Procedures. , 1964 .

[25]  Fengqing Jiang,et al.  Spatial distribution and contamination assessment of heavy metals in urban road dusts from Urumqi, NW China , 2009 .

[26]  Saeid Minaei,et al.  Reflectance Vis/NIR spectroscopy for nondestructive taste characterization of Valencia oranges , 2012 .

[27]  R. Tibshirani,et al.  An introduction to the bootstrap , 1993 .

[28]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[29]  Ding Li-xia,et al.  Continuum removal based hyperspectral characteristic analysis of leaves of different tree species. , 2010 .

[30]  Fangbai Li,et al.  Using ensemble models to identify and apportion heavy metal pollution sources in agricultural soils on a local scale. , 2015, Environmental pollution.

[31]  Alexander Hapfelmeier,et al.  Variable selection by Random Forests using data with missing values , 2014, Comput. Stat. Data Anal..

[32]  Shinichi Morishita,et al.  On Classification and Regression , 1998, Discovery Science.

[33]  Godwin A. Ayoko,et al.  Diffuse reflectance spectroscopy for monitoring potentially toxic elements in the agricultural soils of Changjiang River Delta, China , 2012 .

[34]  W. Dean Hively,et al.  Strategies for soil quality assessment using VNIR gyperspectral spectroscopy in a western Kenya Chronosequence , 2012 .

[35]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[36]  Zongwei Ma,et al.  A review of soil heavy metal pollution from mines in China: pollution and health risk assessment. , 2014, The Science of the total environment.

[37]  D. F. Malley,et al.  Use of Near-Infrared Reflectance Spectroscopy in Prediction of Heavy Metals in Freshwater Sediment by Their Association with Organic Matter , 1997 .

[38]  G. Irace,et al.  Second‐Derivative Spectroscopy of Proteins , 1978 .

[39]  Richard A. Berk Classification and Regression Trees (CART) , 2008 .

[40]  Jianhua Gong,et al.  UAV Remote Sensing for Urban Vegetation Mapping Using Random Forest and Texture Analysis , 2015, Remote. Sens..

[41]  Roman M. Balabin,et al.  Support vector machine regression (SVR/LS-SVM)--an alternative to neural networks (ANN) for analytical chemistry? Comparison of nonlinear methods on near infrared (NIR) spectroscopy data. , 2011, The Analyst.

[42]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[43]  D. Massart,et al.  The influence of data pre-processing in the pattern recognition of excipients near-infrared spectra. , 1999, Journal of Pharmaceutical and Biomedical Analysis.

[44]  Liu Xiang-nan Hyperspectral Remote Sensing Estimation Model for Cd Concentration in Rice Using Support Vector Machines , 2012 .

[45]  Tiezhu Shi,et al.  Prediction of low heavy metal concentrations in agricultural soils using visible and near-infrared reflectance spectroscopy , 2014 .

[46]  Lakhmi C. Jain,et al.  Feature Selection for Data and Pattern Recognition , 2014, Feature Selection for Data and Pattern Recognition.

[47]  P. Rathod,et al.  Proximal Spectral Sensing to Monitor Phytoremediation of Metal-Contaminated Soils , 2013, International journal of phytoremediation.

[48]  Michel Tenenhaus,et al.  PLS path modeling , 2005, Comput. Stat. Data Anal..

[49]  Kurt Hornik,et al.  Misc Functions of the Department of Statistics, ProbabilityTheory Group (Formerly: E1071), TU Wien , 2015 .

[50]  Yindi Zhao,et al.  The safety study of heavy metal pollution in wheat planted in reclaimed soil of mining areas in Xuzhou, China , 2012, Environmental Earth Sciences.

[51]  I. Obrusník,et al.  General Least-Squares Smoothing and Differentiation by the Convolution (Savitzky-Golay) Method , 1990 .

[52]  Guofeng Wu,et al.  Visible and near-infrared reflectance spectroscopy-an alternative for monitoring soil contamination by heavy metals. , 2014, Journal of hazardous materials.

[53]  Paola Zuccolotto,et al.  Variable Selection Using Random Forests , 2006 .

[54]  Jean-Michel Poggi,et al.  Variable selection using random forests , 2010, Pattern Recognit. Lett..

[55]  T. Fearn,et al.  Classification and Regression Trees (CART) , 2020, Statistical Learning from a Regression Perspective.

[56]  Peijun Du,et al.  Estimation of Arsenic Contamination in Reclaimed Agricultural Soils Using Reflectance Spectroscopy and ANFIS Model , 2014, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[57]  Emmanuel John M. Carranza,et al.  Random forest predictive modeling of mineral prospectivity with small number of prospects and data with missing values in Abra (Philippines) , 2015, Comput. Geosci..

[58]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[59]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[60]  V. Rodriguez-Galiano,et al.  Machine learning predictive models for mineral prospectivity: an evaluation of neural networks, random forest, regression trees and support vector machines , 2015 .

[61]  Carlos Roberto de Souza Filho,et al.  A review on spectral processing methods for geological remote sensing , 2016, Int. J. Appl. Earth Obs. Geoinformation.

[62]  Philippe Lagacherie,et al.  Continuum removal versus PLSR method for clay and calcium carbonate content estimation from laboratory and airborne hyperspectral measurements , 2008 .

[63]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[64]  Frans van den Berg,et al.  Review of the most common pre-processing techniques for near-infrared spectra , 2009 .

[65]  Mahesh Pal,et al.  Random forest classifier for remote sensing classification , 2005 .

[66]  P. A. Gorry General least-squares smoothing and differentiation by the convolution (Savitzky-Golay) method , 1990 .