Evaluating influences of seasonal variations and anthropogenic activities on alluvial groundwater hydrochemistry using ensemble learning approaches

Chemical composition and hydrochemistry of groundwater is influenced by the seasonal variations and anthropogenic activities in a region. Understanding of such influences and responsible factors is vital for the effective management of groundwater. In this study, ensemble learning based classification and regression models are constructed and applied to the groundwater hydrochemistry data of Unnao and Ghaziabad regions of northern India. Accordingly, single decision tree (SDT), decision tree forest (DTF), and decision treeboost (DTB) models were constructed. Predictive and generalization abilities of the proposed models were investigated using several statistical parameters and compared with the support vector machines (SVM) method. The DT and SVM models discriminated the groundwater in shallow and deep aquifers, industrial and non-industrial areas, and pre- and post-monsoon seasons rendering misclassification rate (MR) between 1.52–14.92% (SDT); 0.91–6.52% (DTF); 0.61–5.27% (DTB), and 1.52–11.69% (SVM), respectively. The respective regression models yielded a correlation between measured and predicted values of COD and root mean squared error of 0.874, 0.66 (SDT); 0.952, 0.48 (DTF); 0.943, 0.52 (DTB); and 0.785, 0.85 (SVR) in complete data array of Ghaziabad. The DTF and DTB models outperformed the SVM both in classification and regression. It may be noted that incorporation of the bagging and stochastic gradient boosting algorithms in DTF and DTB models, respectively resulted in their enhanced predictive ability. The proposed ensemble models successfully delineated the influences of seasonal variations and anthropogenic activities on groundwater hydrochemistry and can be used as effective tools for forecasting the chemical composition of groundwater for its management.

[1]  Yue Yu,et al.  In silico prediction of Tetrahymena pyriformis toxicity for diverse industrial chemicals with substructure pattern recognition and machine learning methods. , 2011, Chemosphere.

[2]  A. Nema,et al.  Heavy metals assessment in urban soil around industrial clusters in Ghaziabad, India: probabilistic health risk approach. , 2013, Ecotoxicology and environmental safety.

[3]  A. Malik,et al.  Artificial neural network modeling of the river water quality—A case study , 2009 .

[4]  Predicting Academic Success from Student Enrolment Data using Decision Tree Technique , 2012 .

[5]  Nikita Basant,et al.  Modeling the performance of "up-flow anaerobic sludge blanket" reactor based wastewater treatment plant using linear and nonlinear approaches--a case study. , 2010, Analytica chimica acta.

[6]  Zne-Jung Lee,et al.  Parameter determination of support vector machine and feature selection using simulated annealing approach , 2008, Appl. Soft Comput..

[7]  Tarun Chopra,et al.  Fault Diagnosis in Benchmark Process Control System Using Stochastic Gradient Boosted Decision Trees , 2011 .

[8]  P. J. Huber Robust Estimation of a Location Parameter , 1964 .

[9]  Dinesh Mohan,et al.  Chemometric analysis of groundwater quality data of alluvial aquifer of Gangetic plain, North India , 2005 .

[10]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[11]  Dawei Han,et al.  Assessment of input variables determination on the SVM model performance using PCA, Gamma test, and forward selection techniques for monthly stream flow prediction , 2011 .

[12]  Jian Ma,et al.  A comparative assessment of ensemble learning for credit scoring , 2011, Expert Syst. Appl..

[13]  Shakeel Ahmed,et al.  Seasonal behaviour of spatial variability of groundwater level in a granitic aquifer in monsoon climate , 2003 .

[14]  Stefan Tsakovski,et al.  Hasse diagram technique as a tool for water quality assessment. , 2013, Analytica chimica acta.

[15]  Halil Ibrahim Erdal,et al.  Advancing monthly streamflow prediction accuracy of CART models using ensemble learning paradigms , 2013 .

[16]  Amir Etemad-Shahidi,et al.  An alternative approach for the prediction of significant wave heights based on classification and regression trees , 2008 .

[17]  Niklas Elmqvist,et al.  Animated Visualization of Causal Relations Through Growing 2D Geometry , 2004, Inf. Vis..

[18]  Ching Y. Suen,et al.  Application of majority voting to pattern recognition: an analysis of its behavior and performance , 1997, IEEE Trans. Syst. Man Cybern. Part A.

[19]  J. Friedman Stochastic gradient boosting , 2002 .

[20]  C W Yap,et al.  Classification of a diverse set of Tetrahymena pyriformis toxicity chemical compounds from molecular descriptors by statistical learning methods. , 2006, Chemical research in toxicology.

[21]  Premanjali Rai,et al.  Predicting adsorptive removal of chlorophenol from aqueous solution using artificial intelligence based modeling approaches , 2013, Environmental Science and Pollution Research.

[22]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[23]  N. Coops,et al.  Modeling the occurrence of 15 coniferous tree species throughout the Pacific Northwest of North America using a hybrid approach of a generic process‐based growth model and decision tree analysis , 2011 .

[24]  N. Mondal,et al.  Aquifer characteristics and its modeling around an industrial complex, Tuticorin, Tamil Nadu, India: A case study , 2009 .

[25]  Dinesh Mohan,et al.  Exploring groundwater hydrochemistry of alluvial aquifers using multi-way modeling. , 2007, Analytica chimica acta.

[26]  Chun-Xia Zhang,et al.  An empirical study of using Rotation Forest to improve regressors , 2008, Appl. Math. Comput..

[27]  T. Hancock,et al.  A performance comparison of modern statistical techniques for molecular descriptor selection and retention prediction in chromatographic QSRR studies , 2005 .

[28]  Rafael Pino-Mejías,et al.  Reduced bootstrap aggregating of learning algorithms , 2008, Pattern Recognit. Lett..

[29]  Albert Y. Zomaya,et al.  A Review of Ensemble Methods in Bioinformatics , 2010, Current Bioinformatics.

[30]  S. Grunwald,et al.  Tree-based modeling of complex interactions of phosphorus loadings and environmental factors. , 2009, The Science of the total environment.

[31]  Ton H. Snelder,et al.  Predictive mapping of the natural flow regimes of France , 2009 .

[32]  A. E. Greenberg,et al.  Standard methods for the examination of water and wastewater : supplement to the sixteenth edition , 1988 .

[33]  K. P. Singh,et al.  Support vector machines in water quality management. , 2011, Analytica chimica acta.

[34]  S. Tsakovski,et al.  Hasse Diagram Technique Contributions to Environmental Risk Assessment , 2014 .

[35]  Nikita Basant,et al.  Linear and nonlinear modeling for simultaneous prediction of dissolved oxygen and biochemical oxygen demand of the surface water — A case study , 2010 .

[36]  A. Malik,et al.  Distribution of nitrogen species in groundwater aquifers of an industrial area in alluvial Indo-Gangetic Plains—a case study , 2006, Environmental geochemistry and health.

[37]  Udaya B. Kogalur,et al.  Consistency of Random Survival Forests. , 2008, Statistics & probability letters.

[38]  Jui-Sheng Chou,et al.  Optimizing the Prediction Accuracy of Concrete Compressive Strength Based on a Comparison of Data-Mining Techniques , 2011, J. Comput. Civ. Eng..

[39]  Yong Pan,et al.  Advantages of support vector machine in QSPR studies for predicting auto-ignition temperatures of organic compounds , 2008 .

[40]  Guo-Qing Wu,et al.  [Determination of chemical oxygen demand in water using near infrared transmission and UV absorbance method]. , 2011, Guang pu xue yu guang pu fen xi = Guang pu.

[41]  N. Lauzon,et al.  Generalisation for neural networks through data sampling and training procedures, with applications to streamflow predictions , 2004 .

[42]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.