BOD5 prediction using machine learning methods

Biological oxygen demand (BOD5) is an indicator used to monitor water quality. However, the standard process of measuring BOD5 is time consuming and could delay crucial mitigation works in the event of pollution. To solve this problem, this study employed multiple machine learning (ML) methods such as random forest (RF), support vector regression (SVR) and multilayer perceptron (MLP) to train a best model that can accurately predict the BOD5 values in water samples based on other physical and chemical properties of the water. The training parameters were optimized using genetic algorithm (GA) and feature selection was made using the sequential feature selection (SFS) method. The proposed machine learning framework was first tested on a public dataset (Waterbase). The MLP method produced the best model, with an R2 score of 0.7672791942775417, relative mean squared error (MSE) and relative mean absolute error (MAE) of approximately 15%. Feature importance calculations indicated that chemical oxygen demand (CODCr), ammonium and nitrate are features that highly correlate to BOD5. In the field study with a small private dataset consisting of water samples collected from two different lakes in Jiangsu Province of China, the trained model was found to have a similar range of prediction error (around 15%), a similar relative MAE (around 14%) and achieved about 6% better relative RMSE.

[1]  Bijan Yeganeh,et al.  Prediction of CO concentrations based on a hybrid Partial Least Square and Support Vector Machine model , 2012 .

[2]  Lawrence C McCandless,et al.  Evaluation of random forest regression and multiple linear regression for predicting indoor fine particulate matter concentrations in a highly polluted city. , 2019, Environmental pollution.

[3]  G. Schwarz,et al.  A hybrid machine learning model to predict and visualize nitrate concentration throughout the Central Valley aquifer, California, USA. , 2017, The Science of the total environment.

[4]  Joon Ha Kim,et al.  Development of early-warning protocol for predicting chlorophyll-a concentration using machine learning models in freshwater and estuarine reservoirs, Korea. , 2015, The Science of the total environment.

[5]  V. A. Alferov,et al.  BOD Biosensors: Application of Novel Technologies and Prospects for the Development , 2013 .

[6]  Ravi Sankar,et al.  Time Series Prediction Using Support Vector Machines: A Survey , 2009, IEEE Computational Intelligence Magazine.

[7]  D. Gui,et al.  A comparative analysis of artificial neural networks and wavelet hybrid approaches to long-term toxic heavy metal prediction , 2020, Scientific Reports.

[8]  Jully Tan,et al.  Resource Allocation in Multiple Energy-Integrated Biorefinery Using Neuroevolution and Mathematical Optimization , 2021 .

[9]  Le Dinh Van Khoa,et al.  Comparison between Artificial Neural Networks and Support Vector Machine Modeling for Polycaprolactone Synthesis via Enzyme Catalyzed Polymerization , 2021, Process Integration and Optimization for Sustainability.

[10]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[11]  E. Doğan,et al.  Modeling biological oxygen demand of the Melen River in Turkey using an artificial neural network technique. , 2009, Journal of environmental management.

[12]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[13]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[14]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[15]  Wenjian Wang,et al.  Online prediction model based on support vector machine , 2008, Neurocomputing.

[16]  Yongquan Zhou,et al.  An improved quantum-inspired cooperative co-evolution algorithm with muli-strategy and its application , 2021, Expert Syst. Appl..

[17]  F. Coulon,et al.  Prediction of bioavailability and toxicity of complex chemical mixtures through machine learning models. , 2019, Chemosphere.

[18]  M. Salavati‐Niasari,et al.  Green synthesis of dysprosium stannate nanoparticles using Ficus carica extract as photocatalyst for the degradation of organic pollutants under visible irradiation , 2020 .

[19]  Qingyang Xiao,et al.  Predicting monthly high-resolution PM2.5 concentrations with random forest model in the North China Plain. , 2018, Environmental pollution.

[20]  Jui-Sheng Chou,et al.  Determining quality of water in reservoir using machine learning , 2018, Ecol. Informatics.

[21]  P. J. García Nieto,et al.  Water eutrophication assessment relied on various machine learning techniques: A case study in the Englishmen Lake (Northern Spain) , 2019, Ecological Modelling.

[22]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[23]  Huimin Zhao,et al.  An Enhanced MSIQDE Algorithm With Novel Multiple Strategies for Global Optimization Problems , 2022, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[24]  M. I. L'vovich,et al.  World Fresh Water Resources , 2013 .