Neglecting spatial autocorrelation causes underestimation of the error of sugarcane yield models

Abstract With the increased application of information technology in agriculture, data is being produced and used in an unprecedented scale. While these advances, combined with machine learning techniques, benefited yield modeling, most of the current literature about data-driven yield modeling has not yet accounted for potential sources of correlation in data, assuming independence between samples. In this scenario, random sampling can lead to correlated samples across sets being used for model evaluation. We implemented a spatially-aware protocol and compared it with the naive approach of assuming independence between samples. The protocols were applied through all the model development pipeline: data splitting for hold-out sets, feature selection, cross-validation for model adjustment and model evaluation. Three different machine learning techniques were used to create models in each protocol. The resulting models were evaluated both in the validation set created by each protocol and in a manually created independent set. This independent set ensured there was no auto-correlation between the samples used for modeling. We showed that assuming independence when modeling yield leads to underestimating model errors and overfit during model adjustment. Despite better error tracking, the model with the smallest error in the test set was not the model with the smallest validation error, suggesting overfit for the model selection. While this effect was small for the spatially-aware protocol, the effect was a lot stronger in the naive protocol. Future efforts in yield modeling should address the effect of spatial auto-correlation and other potential sources of correlation to improve correctness and robustness of the results.

[1]  Tomislav Hengl,et al.  Improving performance of spatio-temporal machine learning models using forward feature selection and target-oriented validation , 2018, Environ. Model. Softw..

[2]  D. Griffith Spatial Autocorrelation and Spatial Filtering: Gaining Understanding Through Theory and Scientific Visualization , 2010 .

[3]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[4]  F. V. Scarpare,et al.  Sugarcane water footprint under different management practices in Brazil: Tietê/Jacaré watershed assessment , 2016 .

[5]  C. E. Carter,et al.  Productivity of sugarcane on narrow rows, as affected by mechanical harvesting , 1991 .

[6]  William L. Crosson,et al.  A daily merged MODIS Aqua–Terra land surface temperature data set for the conterminous United States , 2012 .

[7]  Alex J. Cannon,et al.  Maize yield forecasting by linear regression and artificial neural networks in Jilin, China , 2014, The Journal of Agricultural Science.

[8]  R. Yadav High population density management in sugarcane , 1991 .

[9]  J. Friedman Stochastic gradient boosting , 2002 .

[10]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[11]  Jia Liu,et al.  Mapping of Daily Mean Air Temperature in Agricultural Regions Using Daytime and Nighttime Land Surface Temperatures Derived from TERRA and AQUA MODIS Data , 2015, Remote. Sens..

[12]  R. A. Lawes,et al.  Applications of industry information in sugarcane production systems , 2005 .

[13]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[14]  Marti A. Hearst Trends & Controversies: Support Vector Machines , 1998, IEEE Intell. Syst..

[15]  Supachai Pathumnakul,et al.  Harvest scheduling algorithm to equalize supplier benefits: A case study from the Thai sugar cane industry , 2015, Comput. Electron. Agric..

[16]  Felipe Ferreira Bocca,et al.  When do I want to know and why? Different demands on sugarcane yield predictions , 2015 .

[17]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[18]  Felipe Ferreira Bocca,et al.  From spreadsheets to sugar content modeling: A data mining approach , 2017, Comput. Electron. Agric..

[19]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[20]  Carsten F. Dormann,et al.  Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure , 2017 .

[21]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[22]  J. Alvarez,et al.  A yield prediction model for Florida sugarcane , 1982 .

[23]  Felipe Ferreira Bocca,et al.  The effect of tuning, feature engineering, and feature selection in data mining applied to rainfed sugarcane yield modelling , 2016, Comput. Electron. Agric..

[24]  G. Huffman,et al.  The TRMM Multi-Satellite Precipitation Analysis (TMPA) , 2010 .

[25]  Zailin Huo,et al.  Simulation for response of crop yield to soil moisture and salinity with artificial neural network , 2011 .

[26]  Korbinian Strimmer,et al.  APE: Analyses of Phylogenetics and Evolution in R language , 2004, Bioinform..

[27]  Alexander Brenning,et al.  Data Mining in Precision Agriculture: Management of Spatial Information , 2010, IPMU.

[28]  Kamran Davary,et al.  Deriving data mining and regression based water-salinity production functions for spring wheat (Triticum aestivum) , 2014 .

[29]  Xiangfeng Wang,et al.  Machine learning for Big Data analytics in plants. , 2014, Trends in plant science.

[30]  Y. Everingham,et al.  Accurate prediction of sugarcane yield using a random forest algorithm , 2016, Agronomy for Sustainable Development.