A transfer Learning-Based LSTM strategy for imputing Large-Scale consecutive missing data and its application in a water quality prediction system

Abstract In recent years, water quality monitoring has been crucial to improve water resource protection and management. Under the relevant laws and regulations, environmental protection department agencies monitor lakes, streams, rivers, and other types of water bodies to assess water quality conditions. The valid and high-quality data generated from these monitoring activities help water resource managers understand the existing pollution situations, energy consumption problems and pollution control needs. However, there are inevitably many problems with water quality data in the real world due to human mistakes or system failures. One of the most frequently occurring issues is missing data. Although most existing studies have explored classic statistical methods or emerging machine/deep learning methods to fill gaps in data, these methods are not suitable for large-scale consecutive missing data problems. To address this issue, this paper proposes a novel algorithm called TrAdaBoost-LSTM, which integrates state-of-the-art deep learning theory through long short-term memory (LSTM) and instance-based transfer learning through TrAdaBoost. This model inherits the full advantages of the LSTM model and transfer learning technique, namely the powerful ability to capture the long-term dependencies among time series and the flexibility of leveraging the related knowledge from complete datasets to fill in large-scale consecutive missing data. A case study involving Dissolved Oxygen concentrations obtained from water quality monitoring stations is conducted to validate the effectiveness and superiority of the proposed method. The results show that the proposed TrAdaBoost-LSTM model not only improves the imputation accuracy by 15%~25% compared with that of alternative models based on the obtained performance indicators, but also provides potential ideas for similar future research.

[1]  Peiyue Li,et al.  Progress, opportunities, and key fields for groundwater quality research under the impacts of human activities in China with a special focus on western China , 2017, Environmental Science and Pollution Research.

[2]  Feng Zhou,et al.  Nonlinear compensation algorithm for multidimensional temporal data: A missing value imputation for the power grid applications , 2021, Knowl. Based Syst..

[3]  Guo-ce Xu,et al.  Seasonal changes in water quality and its main influencing factors in the Dan River basin , 2019, CATENA.

[4]  Hossein Tabari,et al.  Reconstruction of river water quality missing data using artificial neural networks , 2015 .

[5]  A. Sharafati,et al.  Application of Soft Computing Models for Simulating Nitrate Contamination in Groundwater: Comprehensive Review, Assessment and Future Opportunities , 2020, Archives of Computational Methods in Engineering.

[6]  Shichao Zhang,et al.  The Journal of Systems and Software , 2012 .

[7]  P. Zheng,et al.  Distribution and diversity of anaerobic ammonium-oxidizing bacteria in the sediments of the Qiantang River. , 2012, Environmental microbiology reports.

[8]  Hugo Gamboa,et al.  Time Alignment Measurement for Time Series , 2018, Pattern Recognit..

[9]  Min Zuo,et al.  Comparative analysis of surface water quality prediction performance and identification of key water parameters using different machine learning models based on big data. , 2019, Water research.

[10]  Yi-Fan Zhang,et al.  SSIM—A Deep Learning Approach for Recovering Missing Time Series Sensor Data , 2018, IEEE Internet of Things Journal.

[11]  Guo H. Huang,et al.  Wavelet-based multiresolution analysis for data cleaning and its application to water quality management systems , 2008, Expert Syst. Appl..

[12]  Z. Yaseen,et al.  River water quality index prediction and uncertainty analysis: A comparative study of machine learning models , 2020 .

[13]  David Byer,et al.  Real‐time detection of intentional chemical contamination in the distribution system , 2005 .

[14]  Susan Armijo-Olivo,et al.  Intention to treat analysis, compliance, drop-outs and how to deal with missing data in clinical research: a review , 2009 .

[15]  Ahmad Sharafati,et al.  The Integration of Nature-Inspired Algorithms with Least Square Support Vector Regression Models: Application to Modeling River Dissolved Oxygen Concentration , 2018, Water.

[16]  YuanTong Gu,et al.  Comparison between the radial point interpolation and the Kriging interpolation used in meshfree methods , 2003 .

[17]  Yinhai Wang,et al.  A hybrid approach to integrate fuzzy C-means based imputation method with genetic algorithm for missing traffic volume data estimation , 2015 .

[18]  Yuan-yuan Chen,et al.  Cross components calibration transfer of NIR spectroscopy model through PCA and weighted ELM-based TrAdaBoost algorithm , 2019, Chemometrics and Intelligent Laboratory Systems.

[19]  Qiang Yang,et al.  Boosting for transfer learning , 2007, ICML '07.

[20]  C. Chu,et al.  A water quality management strategy for regionally protected water through health risk assessment and spatial distribution of heavy metal pollution in 3 marine reserves. , 2017, The Science of the total environment.

[21]  Mingqi Lv,et al.  Air quality estimation by exploiting terrain features and multi-view transfer semi-supervised regression , 2019, Inf. Sci..

[22]  Peng Jiang,et al.  Water quality prediction based on recurrent neural network and improved evidence theory: a case study of Qiantang River, China , 2019, Environmental Science and Pollution Research.

[23]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[24]  Xiang Li,et al.  Long short-term memory neural network for air pollutant concentration predictions: Method development and evaluation. , 2017, Environmental pollution.

[25]  Guishan Yang,et al.  Multidecadal water quality deterioration in the largest freshwater lake in China (Poyang Lake): Implications on eutrophication management. , 2020, Environmental pollution.

[26]  D. Chapman,et al.  Developments in water quality monitoring and management in large river catchments using the Danube River as an example , 2016 .

[27]  Jianfeng Yao,et al.  A multiple-imputation Metropolis version of the EM algorithm , 2003 .

[28]  Maria Elisa Quinteros,et al.  Use of data imputation tools to reconstruct incomplete air quality datasets: A case-study in Temuco, Chile , 2019, Atmospheric Environment.

[29]  Yan Liu,et al.  Recurrent Neural Networks for Multivariate Time Series with Missing Values , 2016, Scientific Reports.

[30]  Lei Ge,et al.  Exploring the attention mechanism in LSTM-based Hong Kong stock price movement prediction , 2019, Machine Learning and AI in Finance.

[31]  Soo-Hyung Kim,et al.  Hidden dynamic learning for long-interval consecutive missing values reconstruction in EEG time series , 2011, 2011 IEEE International Conference on Granular Computing.

[32]  Jiahui Wang,et al.  Modeling Financial Time Series with S-PLUS® , 2003 .

[33]  M. Sakata,et al.  Investigating and mapping spatial patterns of arsenic contamination in groundwater using regression analysis and spline interpolation technique , 2013 .

[34]  Motahareh Saadatpour,et al.  A fuzzy equilibrium strategy for sustainable water quality management in river-reservoir system , 2020 .

[35]  Clémentine Prieur,et al.  Reconstruction of missing daily streamflow data using dynamic regression models , 2015 .

[36]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[37]  Yanlai Zhou,et al.  Real-time probabilistic forecasting of river water quality under data missing situation: Deep learning plus post-processing techniques , 2020 .

[38]  Yuexiong Ding,et al.  Transfer learning for long-interval consecutive missing values imputation without external features in air pollution time series , 2020, Adv. Eng. Informatics.

[39]  M. Fournier,et al.  Reconstruction of missing groundwater level data by using Long Short-Term Memory (LSTM) deep neural network , 2020 .

[40]  Xinwei Deng,et al.  Missing data imputation for paired stream and air temperature sensor data , 2017 .

[41]  Wang Ke,et al.  The Application of Cluster Analysis and Inverse Distance-Weighted Interpolation to Appraising the Water Quality of Three Forks Lake , 2011 .

[42]  Xuesong Wang,et al.  Improving the transferability of the crash prediction model using the TrAdaBoost.R2 algorithm. , 2020, Accident; analysis and prevention.

[43]  O. Kisi,et al.  Application of least square support vector machine and multivariate adaptive regression spline models in long term prediction of river water pollution , 2016 .

[44]  Jingxian Liu,et al.  Adaptively constrained dynamic time warping for time series classification and clustering , 2020, Inf. Sci..

[45]  Osman N. Ucan,et al.  Application of cellular neural network (CNN) to the prediction of missing air pollutant data , 2011 .

[46]  Tak-Chung Fu,et al.  A review on time series data mining , 2011, Eng. Appl. Artif. Intell..

[47]  Chi Wang,et al.  Na/K-ATPase Y260 Phosphorylation–mediated Src Regulation in Control of Aerobic Glycolysis and Tumor Growth , 2018, Scientific Reports.

[48]  Weiwei Chen,et al.  A bi-directional missing data imputation scheme based on LSTM and transfer learning for building energy data , 2020, Energy and Buildings.

[49]  Li-Chiu Chang,et al.  Exploring a Long Short-Term Memory based Encoder-Decoder framework for multi-step-ahead flood forecasting , 2020 .

[50]  Taghi M. Khoshgoftaar,et al.  A survey of transfer learning , 2016, Journal of Big Data.

[51]  A. Sharafati,et al.  The potential of new ensemble machine learning models for effluent quality parameters prediction and related uncertainty , 2020 .

[52]  Yan Tian,et al.  LSTM-based traffic flow prediction with missing data , 2018, Neurocomputing.

[53]  Amaury Lendasse,et al.  Regularized extreme learning machine for regression with missing data , 2013, Neurocomputing.

[54]  Tao Jin,et al.  A data-driven model for real-time water quality prediction and early warning by an integration method , 2019, Environmental Science and Pollution Research.

[55]  A. Ziegler,et al.  Correcting Systematic Underprediction of Biochemical Oxygen Demand in Support Vector Regression , 2017 .

[56]  L. Sprague,et al.  Water-quality trends in US rivers: Exploring effects from streamflow trends and changes in watershed management. , 2019, The Science of the total environment.