Applying Tree Ensemble to Detect Anomalies in Real-World Water Composition Dataset

Drinking water is one of fundamental human needs. During delivery in distribution network, drinking water is susceptible to contaminants. Early recognition of changes in water quality is essential in the provision of clean and safe drinking water. For this purpose, Contamination warning system (CWS) composed of sensors, central database and event detection system (EDS) has been developed. Conventionally, EDS employs time series analysis and domain knowledge for automated detection. This paper proposes a general data driven approach to construct an automated online event detention system for drinking water. Various tree ensemble models are investigated in application to real-world water quality data. In particular, gradient boosting methods are shown to overcome challenges in time series data imbalanced class and collinearity and yield satisfied predictive performance.

[1]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[2]  Doina Logofatu,et al.  Approaches to Building a Detection Model for Water Quality: A Case Study , 2018, ACIIDS.

[3]  Regan Murray,et al.  Testing and Evaluation of Water Quality Event Detection Algorithms , 2011 .

[4]  George C. Runger,et al.  Gene selection with guided regularized random forest , 2012, Pattern Recognit..

[5]  James D. Hamilton Time Series Analysis , 1994 .

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  R. Haught,et al.  Real-time contaminant detection and classification in a drinking water pipe using conventional water quality sensors: techniques and experimental results. , 2009, Journal of environmental management.

[8]  Ping Li,et al.  Robust LogitBoost and Adaptive Base Class (ABC) LogitBoost , 2010, UAI.

[9]  Doina Logofatu,et al.  Review on General Techniques and Packages for Data Imputation in R on a Real World Dataset , 2018, ICCCI.

[10]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[11]  Avi Ostfeld,et al.  Event detection in water distribution systems from multivariate water quality time series. , 2012, Environmental science & technology.

[12]  Kenneth Carlson,et al.  Expanded Summary: Real‐time detection of intentional chemical contamination IN THE DISTRIBUTION SYSTEM , 2005 .

[13]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[14]  Damaris Zurell,et al.  Collinearity: a review of methods to deal with it and a simulation study evaluating their performance , 2013 .

[15]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[16]  Joaquin Quiñonero Candela,et al.  Practical Lessons from Predicting Clicks on Ads at Facebook , 2014, ADKDD'14.

[17]  Ran Gilad-Bachrach,et al.  DART: Dropouts meet Multiple Additive Regression Trees , 2015, AISTATS.

[18]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[19]  Gang Xie,et al.  Data-Driven Water Quality Analysis and Prediction: A Survey , 2017, 2017 IEEE Third International Conference on Big Data Computing Service and Applications (BigDataService).

[20]  Dibo Hou,et al.  Detection of water-quality contamination events based on multi-sensor fusion using an extented Dempster–Shafer method , 2013 .

[21]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[22]  Katherine A. Klise,et al.  MULTIVARIATE APPLICATIONS FOR DETECTING ANOMALOUS WATER QUALITY , 2008 .