A statistical learning framework for groundwater nitrate models of the Central Valley, California, USA

Summary We used a statistical learning framework to evaluate the ability of three machine-learning methods to predict nitrate concentration in shallow groundwater of the Central Valley, California: boosted regression trees (BRT), artificial neural networks (ANN), and Bayesian networks (BN). Machine learning methods can learn complex patterns in the data but because of overfitting may not generalize well to new data. The statistical learning framework involves cross-validation (CV) training and testing data and a separate hold-out data set for model evaluation, with the goal of optimizing predictive performance by controlling for model overfit. The order of prediction performance according to both CV testing R2 and that for the hold-out data set was BRT > BN > ANN. For each method we identified two models based on CV testing results: that with maximum testing R2 and a version with R2 within one standard error of the maximum (the 1SE model). The former yielded CV training R2 values of 0.94–1.0. Cross-validation testing R2 values indicate predictive performance, and these were 0.22–0.39 for the maximum R2 models and 0.19–0.36 for the 1SE models. Evaluation with hold-out data suggested that the 1SE BRT and ANN models predicted better for an independent data set compared with the maximum R2 versions, which is relevant to extrapolation by mapping. Scatterplots of predicted vs. observed hold-out data obtained for final models helped identify prediction bias, which was fairly pronounced for ANN and BN. Lastly, the models were compared with multiple linear regression (MLR) and a previous random forest regression (RFR) model. Whereas BRT results were comparable to RFR, MLR had low hold-out R2 (0.07) and explained less than half the variation in the training data. Spatial patterns of predictions by the final, 1SE BRT model agreed reasonably well with previously observed patterns of nitrate occurrence in groundwater of the Central Valley.

[1]  A. LaMotte,et al.  Spatial analysis of land use and shallow groundwater vulnerability in the watershed adjacent to Assateague Island National Seashore, Maryland and Virginia, USA , 2007 .

[2]  J Elith,et al.  A working guide to boosted regression trees. , 2008, The Journal of animal ecology.

[3]  Michael G. Rupert,et al.  Probability of detecting atrazine/desethyl-atrazine and elevated concentrations of nitrate in ground water in Colorado , 2003 .

[4]  B. Ruddy,et al.  Probability of nitrate contamination of recently recharged groundwaters in the conterminous United States. , 2002, Environmental science & technology.

[5]  N. Plant,et al.  Bridging groundwater models and decision support with a Bayesian network , 2013 .

[6]  T. Reilly,et al.  Ground-Water Availability in the United States , 2008 .

[7]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[8]  Max Kuhn,et al.  Applied Predictive Modeling , 2013 .

[9]  C. Jang,et al.  Probability-based nitrate contamination map of groundwater in Kinmen , 2013, Environmental Monitoring and Assessment.

[10]  C. Faunt,et al.  Groundwater availability of the Central Valley Aquifer, California , 2009 .

[11]  J. Friedman Stochastic gradient boosting , 2002 .

[12]  K. Belitz,et al.  Assessment of regional change in nitrate concentrations in groundwater in the Central Valley, California, USA, 1950s–2000s , 2013, Environmental Earth Sciences.

[13]  Heesung Yoon,et al.  Temporal variability of nitrate concentration in groundwater affected by intensive agricultural activities in a rural area of Hongseong, South Korea , 2015, Environmental Earth Sciences.

[14]  K. Belitz,et al.  Modeling nitrate at domestic and public-supply well depths in the Central Valley, California. , 2014, Environmental science & technology.

[15]  G. De’ath Boosted trees for ecological modeling and prediction. , 2007, Ecology.

[16]  Radford M. Neal,et al.  High Dimensional Classification with Bayesian Neural Networks and Dirichlet Diffusion Trees , 2006, Feature Extraction.

[17]  D. Wheeler,et al.  Modeling groundwater nitrate concentrations in private wells in Iowa. , 2015, The Science of the total environment.

[18]  Karen R. Burow,et al.  The quality of our Nation's waters-Nutrients in the Nation's streams and groundwater, 1992-2004 , 2010 .

[19]  Sharon L. Qi,et al.  Vulnerability of recently recharged groundwater in principal [corrected] aquifers of the United States to nitrate contamination. , 2012, Environmental science & technology.

[20]  Bernard T. Nolan,et al.  Regression model for aquifer vulnerability assessment of nitrate pollution in the Osona region (NE Spain) , 2013 .

[21]  Richard M Vogel,et al.  Predicting ground water nitrate concentration from land use , 2005, Ground water.

[22]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[23]  L. Frans Trends of pesticides and nitrate in ground water of the Central Columbia Plateau, Washington, 1993-2003. , 2008, Journal of environmental quality.

[24]  P. Blum,et al.  Statistical analysis correlating changing agronomic practices with nitrate concentrations in a karst aquifer in Ireland , 2014 .

[25]  Stefan Fritsch,et al.  neuralnet: Training of Neural Networks , 2010, R J..

[26]  D. Silverman,et al.  Modeling the probability of arsenic in groundwater in New England as a tool for exposure assessment. , 2006, Environmental science & technology.

[27]  R. Ankumah,et al.  Nitrate contamination in private wells in rural Alabama, United States. , 2005, The Science of the total environment.

[28]  C. Jang,et al.  Integrating indicator-based geostatistical estimation and aquifer vulnerability of nitrate-N for establishing groundwater protection zones , 2015 .

[29]  Nathaniel G. Plant,et al.  A cross-validation package driving Netica with python , 2015, Environ. Model. Softw..

[30]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[31]  V. Rodriguez-Galiano,et al.  Predictive modeling of groundwater nitrate pollution using Random Forest and multisource variables related to intrinsic and specific vulnerability: a case study in an agricultural setting (Southern Spain). , 2014, The Science of the total environment.

[32]  Peter C. Austin,et al.  Using Ensemble-Based Methods for Directly Estimating Causal Effects: An Investigation of Tree-Based G-Computation , 2012, Multivariate behavioral research.

[33]  N. Duan Smearing Estimate: A Nonparametric Retransformation Method , 1983 .

[34]  JoAnn M. Gronberg,et al.  County-level estimates of nitrogen and phosphorus from commercial fertilizer for the Conterminous United States, 1987–2006 , 2012 .