Identifying the origin of groundwater samples in a multi-layer aquifer system with Random Forest classification

Summary Accurate identification of the origin of groundwater samples is not always possible in complex multilayered aquifers. This poses a major difficulty for a reliable interpretation of geochemical results. The problem is especially severe when the information on the tubewells design is hard to obtain. This paper shows a supervised classification method based on the Random Forest (RF) machine learning technique to identify the layer from where groundwater samples were extracted. The classification rules were based on the major ion composition of the samples. We applied this method to the Campo de Cartagena multi-layer aquifer system, in southeastern Spain. A large amount of hydrogeochemical data was available, but only a limited fraction of the sampled tubewells included a reliable determination of the borehole design and, consequently, of the aquifer layer being exploited. Added difficulty was the very similar compositions of water samples extracted from different aquifer layers. Moreover, not all groundwater samples included the same geochemical variables. Despite of the difficulty of such a background, the Random Forest classification reached accuracies over 90%. These results were much better than the Linear Discriminant Analysis (LDA) and Decision Trees (CART) supervised classification methods. From a total of 1549 samples, 805 proceeded from one unique identified aquifer, 409 proceeded from a possible blend of waters from several aquifers and 335 were of unknown origin. Only 468 of the 805 unique-aquifer samples included all the chemical variables needed to calibrate and validate the models. Finally, 107 of the groundwater samples of unknown origin could be classified. Most unclassified samples did not feature a complete dataset. The uncertainty on the identification of training samples was taken in account to enhance the model. Most of the samples that could not be identified had an incomplete dataset.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  D. R. Cutler,et al.  Utah State University From the SelectedWorks of , 2017 .

[3]  T. Elliot,et al.  Recharge velocity and geochemical evolution for the Permo-Triassic Sherwood Sandstone, Northern Ireland , 2005 .

[4]  S. Kenner,et al.  Multivariate statistical approach to estimate mixing proportions for unknown end members , 2012 .

[5]  A. Pekdeger,et al.  Origin and dynamics of groundwater salinity in the alluvial plains of western Delhi and adjacent territories of Haryana State, India , 2012 .

[6]  G. De’ath,et al.  CLASSIFICATION AND REGRESSION TREES: A POWERFUL YET SIMPLE TECHNIQUE FOR ECOLOGICAL DATA ANALYSIS , 2000 .

[7]  Gunnar Seemann,et al.  Functional Imaging and Modeling of the Heart, 4th International Conference, FIMH 2007, Salt Lake City, UT, USA, June 7-9, 2007, Proceedings , 2007, FIMH.

[8]  Massimo Angelone,et al.  Geochemical characterization of ophiolitic soils in a temperate climate: A multivariate statistical approach , 1997 .

[9]  G. F. Hughes,et al.  On the mean accuracy of statistical pattern recognizers , 1968, IEEE Trans. Inf. Theory.

[10]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[12]  Niko E. C. Verhoest,et al.  Hydrology and Earth System Sciences Modelling Groundwater-dependent Vegetation Patterns Using Ensemble Learning , 2022 .

[13]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[14]  G. Pohll,et al.  Multi-variable mixing cell model as a calibration and validation tool for hydrogeologic groundwater modeling , 2004 .

[15]  Anthony J. Jakeman,et al.  Data Mining in Hydrology , 2003 .

[16]  Ting Wang,et al.  Application of Breiman's Random Forest to Modeling Structure-Activity Relationships of Pharmaceutical Molecules , 2004, Multiple Classifier Systems.

[17]  C. Hawkins,et al.  Predicting natural base‐flow stream water chemistry in the western United States , 2012 .

[18]  H. Celle-Jeanton,et al.  Twenty years of groundwater evolution in the Triassic sandstone aquifer of Lorraine: Impacts on baseline water quality , 2009 .

[19]  M. Raiber,et al.  Use of hierarchical cluster analysis to assess the representativeness of a baseline groundwater quality monitoring network: comparison of New Zealand’s national and regional groundwater monitoring programs , 2012, Hydrogeology Journal.

[20]  A. Prasad,et al.  Newer Classification and Regression Tree Techniques: Bagging and Random Forests for Ecological Prediction , 2006, Ecosystems.

[21]  M. Loos,et al.  Topographic controls on overland flow generation in a forest - An ensemble tree approach , 2011 .

[22]  Jennifer A. Miller,et al.  Contextual land-cover classification: incorporating spatial dependence in land-cover classification models using random forests and the Getis statistic , 2010 .

[23]  G. Teutsch,et al.  A Multivariate Statistical Approach , 2005 .

[24]  R. Sala,et al.  A new key locality for the Pliocene vertebrate record of Europe: the Camp dels Ninots maar (NE Spain) , 2012 .

[25]  René Lefebvre,et al.  Multivariate statistical analysis of geochemical data as indicative of the hydrogeochemical evolution of groundwater in a sedimentary rock aquifer system , 2008 .

[26]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[27]  François Renard,et al.  Kinetics of incongruent dissolution of carbonates in a Chalk aquifer using reverse flow modelling , 2012 .

[28]  Thorsten Behrens,et al.  Instance selection and classification tree analysis for large spatial datasets in digital soil mapping , 2008 .

[29]  Matthew W. Mitchell Bias of the Random Forest Out-of-Bag (OOB) Error for Certain Input Parameters , 2011 .

[30]  R. Aravena,et al.  The Role of Leaky Boreholes in the Contamination of a Regional Confined Aquifer. A Case Study: The Campo de Cartagena Region, Spain , 2011 .

[31]  Bedri Kurtulus,et al.  Evaluation of the ability of an artificial neural network model to simulate the input-output responses of a large karstic aquifer: the La Rochefoucauld aquifer (Charente, France) , 2007 .

[32]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[33]  J. García-Pintado,et al.  Geophysical characterization of the complex dynamics of groundwater and seawater exchange in a highly stressed aquifer system linked to a coastal lagoon (SE Spain) , 2013, Environmental Earth Sciences.

[34]  Jan G. P. W. Clevers,et al.  Assessing the Accuracy of Remotely Sensed Data—Principles and Practices, Second edition, Russell G. Congalton, Kass Green. CRC Press, Taylor & Francis Group, Boca Raton, FL (2009), 183 pp., Price: $99.95, ISBN: 978-1-4200-5512-2 , 2009 .

[35]  Andrew Blake,et al.  Random Forest Classification for Automatic Delineation of Myocardium in Real-Time 3D Echocardiography , 2009, FIMH.

[36]  Johannes R. Sveinsson,et al.  Random Forests for land cover classification , 2006, Pattern Recognit. Lett..

[37]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[38]  A. Mayo,et al.  Ambient well-bore mixing, aquifer cross-contamination, pumping stress, and water quality from long-screened wells: What is sampled and what is not? , 2010 .

[39]  Y. Travi,et al.  Impacts of human activities on recharge in a multilayered semiarid aquifer (Campo de Cartagena, SE Spain) , 2014 .

[40]  Blair Sterba-Boatwright,et al.  Novel application of a statistical technique, Random Forests, in a bacterial source tracking study. , 2010, Water research.

[41]  G. Panagopoulos,et al.  The use of multicomponent statistical analysis in hydrogeological environmental research. , 2004, Water research.

[42]  Mahesh Pal,et al.  Random forest classifier for remote sensing classification , 2005 .

[43]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[44]  H. Celle-Jeanton,et al.  Palaeorecharge conditions of the deep aquifers of the Northern Aquitaine region (France) , 2009 .

[45]  Adele Cutler,et al.  Random forests for microarrays. , 2006, Methods in enzymology.

[46]  C. Harrisd,et al.  Hydrochemical characteristics of aquifers near Sutherland in the Western Karoo , South Africa , 2001 .