Machine learning approaches for anomaly detection of water quality on a real-world data set*

ABSTRACT Accurate detection of water quality changes is a crucial task of water companies. Water supply companies must provide safe drinking water. Nowadays in different areas, we find sensible sensors which monitor data during the time. Normally the data registered by the sensors contain a meaning, such as there can be any event. Sometimes the data are ill-understood and stating if there is an event which is difficult. This work represents the description of several approaches to identifying changes or anomalies occurring on water quality time series data. This work also discusses and proposes a solution to some challenges when dealing with time series data. The following models are applied to water quality data: logistic regression, linear discriminant analysis, support vector machines (SVM), artificial neural network (ANN), deep neural network (DNN), recurrent neural network (RNN) and long short-term memory (LSTM). The performance evaluation is conducted using F-score metric. A simulation study is conducted to check the performance of each algorithm using F-score. Solving imbalanced data is basically intentionally biasing the data to get interesting results instead of accurate results. The results show that all algorithms are vulnerable although SVM, ANN and logistic regressions tend to be a little less vulnerable, while DNN, RNN and LSTM are very vulnerable.

[1]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[2]  Gang Xie,et al.  Data-Driven Water Quality Analysis and Prediction: A Survey , 2017, 2017 IEEE Third International Conference on Big Data Computing Service and Applications (BigDataService).

[3]  Yan Liu,et al.  Recurrent Neural Networks for Multivariate Time Series with Missing Values , 2016, Scientific Reports.

[4]  Jiang Liangzhong,et al.  Water Quality Prediction Using LS-SVM and Particle Swarm Optimization , 2009, WKDD.

[5]  Chun Kiat Chang,et al.  Prediction of water quality index in constructed wetlands using support vector machine , 2015, Environmental Science and Pollution Research.

[6]  F L RODKEY The effect of temperature on the oxidation-reduction potential of the diphosphopyridine nucleotide system. , 1959, The Journal of biological chemistry.

[7]  Edward I. Altman,et al.  Corporate distress diagnosis: Comparisons using linear discriminant analysis and neural networks (the Italian experience) , 1994 .

[8]  David Byer,et al.  Real‐time detection of intentional chemical contamination in the distribution system , 2005 .

[9]  Ravi Sankar,et al.  Time Series Prediction Using Support Vector Machines: A Survey , 2009, IEEE Computational Intelligence Magazine.

[10]  Liangzhong Jiang,et al.  Water Quality Prediction Using LS-SVM and Particle Swarm Optimization , 2009 .

[11]  Gary King,et al.  Logistic Regression in Rare Events Data , 2001, Political Analysis.

[12]  Dominique T. Shipmon,et al.  Time Series Anomaly Detection; Detection of anomalous drops with limited features and sparse examples in noisy highly periodic data , 2017, ArXiv.

[13]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[14]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[15]  Doina Logofatu,et al.  Approaches to Building a Detection Model for Water Quality: A Case Study , 2018, ACIIDS.

[16]  R. A. Bottenberg,et al.  APPLIED MULTIPLE LINEAR REGRESSION , 1964 .

[17]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[18]  Doina Logofatu,et al.  Applying Tree Ensemble to Detect Anomalies in Real-World Water Composition Dataset , 2018, IDEAL.

[19]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[20]  Rodkey Fl,et al.  The effect of temperature on the oxidation-reduction potential of the diphosphopyridine nucleotide system. , 1959 .

[21]  Mohamed Bekkar,et al.  Evaluation Measures for Models Assessment over Imbalanced Data Sets , 2013 .

[22]  Doina Logofatu,et al.  Review on General Techniques and Packages for Data Imputation in R on a Real World Dataset , 2018, ICCCI.

[23]  Simon Fong,et al.  An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets , 2013, DaEng.

[24]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[25]  Prudence W. H. Wong,et al.  A real-time anomaly detection algorithm/or water quality data using dual time-moving windows , 2017, 2017 Seventh International Conference on Innovative Computing Technology (INTECH).

[26]  Francisco Herrera,et al.  Data Preprocessing in Data Mining , 2014, Intelligent Systems Reference Library.

[27]  Mi Zhang,et al.  A feature selection-based framework for human activity recognition using wearable multimodal sensors , 2011, BODYNETS.