A tree-based statistical classification algorithm (CHAID) for identifying variables responsible for the occurrence of faecal indicator bacteria during waterworks operations

Microbial contamination of groundwater used for drinking water can affect public health and is of major concern to local water authorities and water suppliers. Potential hazards need to be identified in order to protect raw water resources. We propose a non-parametric data mining technique for exploring the presence of total coliforms (TC) in a groundwater abstraction well and its relationship to readily available, continuous time series of hydrometric monitoring parameters (seven year records of precipitation, river water levels, and groundwater heads). The original monitoring parameters were used to create an extensive generic dataset of explanatory variables by considering different accumulation or averaging periods, as well as temporal offsets of the explanatory variables. A classification tree based on the Chi-Squared Automatic Interaction Detection (CHAID) recursive partitioning algorithm revealed statistically significant relationships between precipitation and the presence of TC in both a production well and a nearby monitoring well. Different secondary explanatory variables were identified for the two wells. Elevated water levels and short-term water table fluctuations in the nearby river were found to be associated with TC in the observation well. The presence of TC in the production well was found to relate to elevated groundwater heads and fluctuations in groundwater levels. The generic variables created proved useful for increasing significance levels. The tree-based model was used to predict the occurrence of TC on the basis of hydrometric variables.

[1]  N. Jayasuriya,et al.  Catchment process affecting drinking water quality, including the significance of rainfall events, using factor analysis and event mean concentrations. , 2010, Journal of water and health.

[2]  Keith Beven,et al.  Macropores and water flow in soils revisited , 2013 .

[3]  D. F. Parkhurst,et al.  Indicator bacteria at five swimming beaches-analysis using random forests. , 2005, Water research.

[4]  L. Pang Microbial removal rates in subsurface media estimated from published studies of field experiments and large intact soil cores. , 2009, Journal of environmental quality.

[5]  E. Soyeux,et al.  Assessment of source water pathogen contamination. , 2007, Journal of Water and Health.

[6]  R. Adams,et al.  Overland flow delivery of faecal bacteria to a headwater pastoral stream , 2005, Journal of applied microbiology.

[7]  Bruce A. Macler,et al.  Current knowledge on groundwater microbial pathogens and their control , 2000 .

[8]  P. Álvarez-Álvarez,et al.  Effects of foliar nutrients and environmental factors on site productivity in Pinus pinaster Ait. stands in Asturias (NW Spain) , 2011, Annals of Forest Science.

[9]  D. Helsel,et al.  Statistical methods in water resources , 2020, Techniques and Methods.

[10]  Sreerama K. Murthy,et al.  Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey , 1998, Data Mining and Knowledge Discovery.

[11]  Rachael M. Jones,et al.  Hydrometeorological variables predict fecal indicator bacteria densities in freshwater: data-driven methods for variable selection , 2013, Environmental Monitoring and Assessment.

[12]  M. Exner,et al.  Microbial Load of Drinking Water Reservoir Tributaries during Extreme Rainfall and Runoff , 2002, Applied and Environmental Microbiology.

[13]  G. V. Kass An Exploratory Technique for Investigating Large Quantities of Categorical Data , 1980 .

[14]  J. Ebdon,et al.  Integrated analysis of water quality parameters for cost-effective faecal pollution management in river catchments. , 2011, Water research.

[15]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[16]  D. Kay,et al.  Extreme water-related weather events and waterborne disease , 2012, Epidemiology and Infection.

[17]  M. Shenker,et al.  Hydrochemical analysis of groundwater using a tree-based model , 2010 .

[18]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[19]  G. Blöschl,et al.  Three-dimensional flow patterns at the river–aquifer interface — a case study at the Danube , 2010 .

[20]  Samuel Mutiti,et al.  Using temperature modeling to investigate the temporal variability of riverbed hydraulic conductivity during storm events , 2010 .

[21]  S. Wuertz,et al.  The impact of point source pollution on shallow groundwater used for human consumption in a threshold country. , 2012, Journal of environmental monitoring : JEM.

[22]  E. Topp,et al.  Seasonal relationships among indicator bacteria, pathogenic bacteria, Cryptosporidium oocysts, Giardia cysts, and hydrological indices for surface waters within an agricultural landscape. , 2009, Water research.

[23]  J. Schubert Hydraulic aspects of riverbank filtration—field studies , 2002 .

[24]  Edward D Rothman,et al.  Statistics, methods and applications , 1987 .

[25]  Richard P. Taylor,et al.  The implications of groundwater velocity variations on microbial transport and wellhead protection - review of field evidence. , 2004, FEMS microbiology ecology.

[26]  Philip Hans Franses,et al.  Evaluating chi-squared automatic interaction detection , 2006, Inf. Syst..

[27]  M. Goss,et al.  Movement of Faecal Bacteria through the Vadose Zone , 2003 .

[28]  K. Hiscock,et al.  Attenuation of groundwater pollution by bank filtration , 2002 .

[29]  Daryl Pregibon,et al.  Tree-based models , 1992 .