Dimensionality Reduction for Water Quality Prediction from a Data Mining Perspective

Biochemical oxygen demand (BOD) is the measurement of the amount of dissolved oxygen used by aerobic microbes for oxidizing organic matter in water bodies and used for analyzing the water quality. The actual BOD prediction method is cumbersome. Instead an automatic prediction model is required that is accurate, faster and less expensive. This paper presents a data-driven model for predicting BOD, in a lower-dimensional space obtained using dimensionality reduction techniques that help remove irrelevant properties of high-dimensional data. Machine learning algorithms, namely decision stump, SVM, MLP, linear regression (LR), and instance-based learner (IBK), were trained with the full dataset with 11 parameters. The training set was later transformed into a lower-dimensional space using principal component analysis (PCA) and correlation-based feature selection (CFS). The performance of the learners on the full training set and transformed dataset was analyzed using correlation coefficient, RMSE, and MAE. The algorithms are able to preserve their predictive accuracy on the lower-dimensional space.