Sequential Imputation of Missing Spatio-Temporal Precipitation Data Using Random Forests

Meteorological records, including precipitation, commonly have missing values. Accurate imputation of missing precipitation values is challenging, however, because precipitation exhibits a high degree of spatial and temporal variability. Data-driven spatial interpolation of meteorological records is an increasingly popular approach in which missing values at a target station are imputed using synchronous data from reference stations. The success of spatial interpolation depends on whether precipitation records at the target station are strongly correlated with precipitation records at reference stations. However, the need for reference stations to have complete datasets implies that stations with incomplete records, even though strongly correlated with the target station, are excluded. To address this limitation, we develop a new sequential imputation algorithm for imputing missing values in spatio-temporal daily precipitation records. We demonstrate the benefits of sequential imputation by incorporating it within a spatial interpolation based on a Random Forest technique. Results show that for reliable imputation, having a few strongly correlated references is more effective than having a larger number of weakly correlated references. Further, we observe that sequential imputation becomes more beneficial as the number of stations with incomplete records increases. Overall, we present a new approach for imputing missing precipitation data which may also apply to other meteorological variables.

[1]  M. Islam,et al.  Comparison of missing value estimation techniques in rainfall data of Bangladesh , 2018, Theoretical and Applied Climatology.

[2]  Lei Chen,et al.  Comparison of the multiple imputation approaches for imputing rainfall data series and their applications to watershed models , 2019, Journal of Hydrology.

[3]  Susan S. Hubbard,et al.  Challenges in Building an End-to-End System for Acquisition, Management, and Integration of Diverse Data From Sensor Networks in Watersheds: Lessons From a Mountainous Community Observatory in East River, Colorado , 2019, IEEE Access.

[4]  Y. Pachepsky,et al.  Reconstructing missing daily precipitation data using regression trees and artificial neural networks for SWAT streamflow simulation. , 2010 .

[5]  J. Banfield,et al.  The East River, Colorado, Watershed: A Mountainous Community Testbed for Improving Predictive Understanding of Multiscale Hydrological–Biogeochemical Dynamics , 2018 .

[6]  Hoshin Vijai Gupta,et al.  Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modelling , 2009 .

[7]  G. S. Dwarakish,et al.  A Review on Hydrological Models , 2015 .

[8]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[9]  Emanuele Barca,et al.  A methodology for treating missing data applied to daily rainfall data in the Candelaro River Basin (Italy) , 2010, Environmental monitoring and assessment.

[10]  Mark R. Segal,et al.  Machine Learning Benchmarks and Random Forest Regression , 2004 .

[11]  D. Dwivedi,et al.  Impact of Intra-meander Hyporheic Flow on Nitrogen Cycling , 2017 .

[12]  Jorge Luis Morales,et al.  Analysis of a new spatial interpolation weighting method to estimate missing data applied to rainfall records , 2019, Atmósfera.

[13]  R. Webster,et al.  Basic Steps in Geostatistics: The Variogram and Kriging , 2015, SpringerBriefs in Agriculture.

[14]  T. Schneider Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values. , 2001 .

[15]  J. Schafer,et al.  Missing data: our view of the state of the art. , 2002, Psychological methods.

[16]  Guillaume Favreau,et al.  AMMA‐CATCH, a Critical Zone Observatory in West Africa Monitoring a Region in Transition , 2018 .

[17]  Jonathan D. Cryer,et al.  Time Series Analysis , 1986 .

[18]  Xuebin Zhang,et al.  Trends in Total Precipitation and Frequency of Daily Precipitation Extremes over China , 2005 .

[19]  P. Shuai,et al.  Kilometer‐Scale Hydrologic Exchange Flows in a Gravel Bed River Corridor and Their Implications to Solute Migration , 2020, Water Resources Research.

[20]  C. Daly,et al.  Physiographically sensitive mapping of climatological temperature and precipitation across the conterminous United States , 2008 .

[21]  Sayang Mohd Deni,et al.  The Effectiveness of a Probabilistic Principal Component Analysis Model and Expectation Maximisation Algorithm in Treating Missing Daily Rainfall Data , 2019, Asia-Pacific Journal of Atmospheric Sciences.

[22]  J. Gómez-Camacho,et al.  A novel approach to precipitation series completion in climatological datasets: application to Andalusia , 2008 .

[23]  S. Hubbard,et al.  Emerging technologies and radical collaboration to advance predictive understanding of watershed hydrobiogeochemistry , 2020, Hydrological Processes.

[24]  M. C. Acock,et al.  Estimating Missing Weather Data for Agricultural Simulations Using Group Method of Data Handling , 2000 .

[25]  Shreenivas Londhe,et al.  Infilling of missing daily rainfall records using artificial neural network , 2015 .

[26]  Cem Iyigun,et al.  Comparison of missing value imputation methods in time series: the case of Turkish meteorological data , 2013, Theoretical and Applied Climatology.

[27]  Ramesh S. V. Teegavarapu,et al.  Improved weighting methods, deterministic and stochastic data-driven models for estimation of missing precipitation records , 2005 .

[28]  D. Shepard A two-dimensional interpolation function for irregularly-spaced data , 1968, ACM National Conference.

[29]  M. Maugeri,et al.  Improving estimation of missing values in daily precipitation series by a probability density function‐preserving approach , 2010 .

[30]  Paulin Coulibaly,et al.  Comparison of neural network methods for infilling missing daily weather records , 2007 .

[31]  R. Teegavarapu Precipitation imputation with probability space-based weighting methods , 2020 .

[32]  Gunnar Lischeid,et al.  A review on missing hydrological data processing , 2018, Environmental Earth Sciences.

[33]  Andrey Gorshenin,et al.  Application of Machine Learning Algorithms to Handle Missing Values in Precipitation Data , 2019, DCCN.

[34]  C. Steefel,et al.  Hot Spots and Hot Moments of Nitrogen in a Riparian Corridor , 2018 .

[35]  Fei Tang,et al.  Random forest missing data algorithms , 2017, Stat. Anal. Data Min..

[36]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[37]  Mathieu Vrac,et al.  Comparison of statistical downscaling methods with respect to extreme events over Europe: Validation results from the perfect predictor experiment of the COST Action VALUE , 2019 .

[38]  Dipankar Dwivedi,et al.  Detecting control system misbehavior by fingerprinting programmable logic controller functionality , 2019, Int. J. Crit. Infrastructure Prot..

[39]  Gilles Louppe,et al.  Understanding Random Forests: From Theory to Practice , 2014, 1407.7502.

[40]  Mahsa Hasanpour Kashani,et al.  Evaluation of efficiency of different estimation methods for missing climatological data , 2011, Stochastic Environmental Research and Risk Assessment.

[41]  M. A. Kohler,et al.  INTERPOLATION OF MISSING PRECIPITATION RECORDS , 1952 .

[42]  Jeffrey G. Arnold,et al.  Model Evaluation Guidelines for Systematic Quantification of Accuracy in Watershed Simulations , 2007 .

[43]  Yacine Rezgui,et al.  Trees vs Neurons: Comparison between random forest and ANN for high-resolution prediction of building energy consumption , 2017 .