Comparison of eight filter-based feature selection methods for monthly streamflow forecasting – Three case studies on CAMELS data sets

Abstract Recently, there has been an increased emphasis on employing data-driven models to forecast streamflow. However, in these data-driven models used for forecasting monthly streamflow, the performances of filter-based feature selection (FFS) methods have not been studied in detail. In this study, we investigated the effectiveness of eight common FFS methods, namely, linear Pearson correlation, partial linear Pearson correlation (PCI), mutual information (MI), conditional MI, partial MI, maximal relevance minimal redundancy Pearson correlation, maximal relevance minimal redundancy MI and gamma test methods, on three regression models, namely multiple linear regression (MLR), ensemble extreme learning machine (enELM) and k-nearest neighbor (KNN) regression, for real-world one-month-ahead streamflow forecasting. The study was conducted on three cases from the Catchment Attributes and Meteorology for Large-sample Studies (CAMELS) data sets. Furthermore, two termination criterion (TC) methods, the Hampel test and resampling, were comparatively analyzed. The results of this study highlight three important findings. First, there was no dominant FFS method that coupled with enELM or KNN. Second, when resampling was applied to select a final model in the candidate combinations of the eight FFS methods and three regression models, PCI was the most favorable FFS method for the final model. Finally, the Hampel test TC was superior to the resampling TC in terms of stability and anti-overfitting. These findings have significant practical reference value for real-world monthly streamflow forecasting.

[1]  Zaher Mundher Yaseen,et al.  An enhanced extreme learning machine model for river flow forecasting: State-of-the-art, practical applications in water resource engineering area and future research direction , 2019, Journal of Hydrology.

[2]  B. LeBaron,et al.  A test for independence based on the correlation dimension , 1996 .

[3]  R. McCuen,et al.  Evaluation of the Nash-Sutcliffe Efficiency Index , 2006 .

[4]  Ahmed El-Shafie,et al.  Improving artificial intelligence models accuracy for monthly streamflow forecasting using grey Wolf optimization (GWO) algorithm , 2020 .

[5]  N. Chang,et al.  Short-term streamflow forecasting with global climate change implications – A comparative study between genetic programming and neural network models , 2008 .

[6]  A. Kai Qin,et al.  Evolutionary extreme learning machine , 2005, Pattern Recognit..

[7]  Jan Adamowski,et al.  Comparative assessment of time series and artificial intelligence models to estimate monthly streamflow: A local and external data analysis approach , 2019 .

[8]  Martyn P. Clark,et al.  Development of a large-sample watershed-scale hydrometeorological data set for the contiguous USA: data set characteristics and assessment of regional variability in hydrologic model performance , 2014 .

[9]  M. Valipour Long‐term runoff study using SARIMA and ARIMA models in the United States , 2015 .

[10]  Han Wang,et al.  Ensemble Based Extreme Learning Machine , 2010, IEEE Signal Processing Letters.

[11]  Shengzhi Huang,et al.  Monthly streamflow prediction using modified EMD-based support vector machine , 2014 .

[12]  Max A. Little,et al.  A Methodology for the Analysis of Medical Data , 2013 .

[13]  Martijn J. Booij,et al.  Simulation and forecasting of streamflows using machine learning models coupled with base flow separation , 2018, Journal of Hydrology.

[14]  Ozgur Kisi,et al.  A wavelet-support vector machine conjunction model for monthly streamflow forecasting , 2011 .

[15]  Martyn P. Clark,et al.  The CAMELS data set: catchment attributes and meteorology for large-sample studies , 2017 .

[16]  Holger R. Maier,et al.  Non-linear variable selection for artificial neural networks using partial mutual information , 2008, Environ. Model. Softw..

[17]  T. Chai,et al.  Root mean square error (RMSE) or mean absolute error (MAE)? – Arguments against avoiding RMSE in the literature , 2014 .

[18]  Holger R. Maier,et al.  Selection of input variables for data driven models: An average shifted histogram partial mutual information estimator approach , 2009 .

[19]  Cheng Liu,et al.  Research and application of ensemble forecasting based on a novel multi-objective optimization algorithm for wind-speed forecasting , 2017 .

[20]  Xizhao Wang,et al.  Dynamic ensemble extreme learning machine based on sample entropy , 2012, Soft Comput..

[21]  Qiang Huang,et al.  Hourly Day-Ahead Wind Power Prediction Using the Hybrid Model of Variational Model Decomposition and Long Short-Term Memory , 2018, Energies.

[22]  Hyun-Han Kwon,et al.  A modified support vector machine based prediction model on streamflow at the Shihmen Reservoir, Taiwan , 2010 .

[23]  Alex J. Cannon,et al.  Daily streamflow forecasting by machine learning methods with weather and climate inputs , 2012 .

[24]  Jianyu Miao,et al.  A Survey on Feature Selection , 2016 .

[25]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[26]  Zaher Mundher Yaseen,et al.  Artificial intelligence based models for stream-flow forecasting: 2000-2015 , 2015 .

[27]  Kwok-wing Chau,et al.  Data-driven input variable selection for rainfall-runoff modeling using binary-coded particle swarm optimization and Extreme Learning Machines , 2015 .

[28]  Chenming Li,et al.  Runoff Prediction Method Based on Adaptive Elman Neural Network , 2019, Water.

[29]  Chee Kheong Siew,et al.  Extreme learning machine: Theory and applications , 2006, Neurocomputing.

[30]  Ximing Cai,et al.  Input variable selection for water resources systems using a modified minimum redundancy maximum relevance (mMRMR) algorithm , 2009 .

[31]  R. Deo,et al.  Stream-flow forecasting using extreme learning machines: a case study in a semi-arid region in Iraq , 2016 .

[32]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[33]  Jan Adamowski,et al.  Bootstrap rank‐ordered conditional mutual information (broCMI): A nonlinear input variable selection method for water resources modeling , 2016 .

[34]  Sinan Jasim Hadi,et al.  Monthly streamflow forecasting using continuous wavelet and multi-gene genetic programming combination , 2018, Journal of Hydrology.

[35]  Hui Qin,et al.  Comparison of support vector regression and extreme gradient boosting for decomposition-based data-driven 10-day streamflow forecasting , 2020, Journal of Hydrology.

[36]  Aytac Guven,et al.  A stepwise model to predict monthly streamflow , 2016 .

[37]  P. Krause,et al.  COMPARISON OF DIFFERENT EFFICIENCY CRITERIA FOR HYDROLOGICAL MODEL ASSESSMENT , 2005 .

[38]  Nikola Bogunovic,et al.  A review of feature selection methods with applications , 2015, 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[39]  D. F. Andrews,et al.  A Robust Method for Multiple Linear Regression , 1974 .

[40]  Jianxun He,et al.  Prediction of event-based stormwater runoff quantity and quality by ANNs developed using PMI-based input selection , 2011 .

[41]  Pablo A. Estévez,et al.  A review of feature selection methods based on mutual information , 2013, Neural Computing and Applications.

[42]  Andrea Castelletti,et al.  An evaluation framework for input variable selection algorithms for environmental data-driven models , 2014, Environ. Model. Softw..

[43]  Zaher Mundher Yaseen,et al.  Novel approach for streamflow forecasting using a hybrid ANFIS-FFA model , 2017 .

[44]  R. Maheswaran,et al.  Wavelet–Volterra coupled model for monthly stream flow forecasting , 2012 .

[45]  Max Kuhn,et al.  Applied Predictive Modeling , 2013 .

[46]  Q. Tan,et al.  An adaptive middle and long-term runoff forecast model using EEMD-ANN hybrid approach , 2018, Journal of Hydrology.

[47]  Aranildo R. Lima,et al.  Nonlinear regression in environmental sciences using extreme learning machines: A comparative evaluation , 2015, Environ. Model. Softw..

[48]  Qiang Huang,et al.  Examining the applicability of different sampling techniques in the development of decomposition-based streamflow forecasting models , 2019, Journal of Hydrology.

[49]  Hui Qin,et al.  Monthly streamflow forecasting based on hidden Markov model and Gaussian Mixture Regression , 2018, Journal of Hydrology.

[50]  Dawei Han,et al.  Assessment of input variables determination on the SVM model performance using PCA, Gamma test, and forward selection techniques for monthly stream flow prediction , 2011 .

[51]  Annika Kangas,et al.  Methods based on k-nearest neighbor regression in the prediction of basal area diameter distribution , 1998 .

[52]  Amparo Alonso-Betanzos,et al.  Filter Methods for Feature Selection - A Comparative Study , 2007, IDEAL.

[53]  Zaher Mundher Yaseen,et al.  Application of soft computing based hybrid models in hydrological variables modeling: a comprehensive review , 2017, Theoretical and Applied Climatology.

[54]  Zhiyong Liu,et al.  Evaluating a coupled discrete wavelet transform and support vector regression for daily and monthly streamflow forecasting , 2014 .

[55]  Yanbin Yuan,et al.  Monthly runoff forecasting based on LSTM–ALO model , 2018, Stochastic Environmental Research and Risk Assessment.

[56]  M. Stone An Asymptotic Equivalence of Choice of Model by Cross‐Validation and Akaike's Criterion , 1977 .

[57]  Marc M. Van Hulle,et al.  Edgeworth Approximation of Multivariate Differential Entropy , 2005, Neural Computation.

[58]  Alireza Sharifi,et al.  Daily runoff prediction using the linear and non-linear models. , 2017, Water science and technology : a journal of the International Association on Water Pollution Research.

[59]  Maziar Palhang,et al.  Generalization performance of support vector machines and neural networks in runoff modeling , 2009, Expert Syst. Appl..

[60]  Xing Fang,et al.  Performance comparison of Adoptive Neuro Fuzzy Inference System (ANFIS) with Loading Simulation Program C++ (LSPC) model for streamflow simulation in El Niño Southern Oscillation (ENSO)-affected watershed , 2015, Expert Syst. Appl..

[61]  Zoltán Szabó,et al.  Information theoretical estimators toolbox , 2014, J. Mach. Learn. Res..

[62]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[63]  Holger R. Maier,et al.  Review of Input Variable Selection Methods for Artificial Neural Networks , 2011 .

[64]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[65]  Lu Chen,et al.  Determination of Input for Artificial Neural Networks for Flood Forecasting Using the Copula Entropy Method , 2014 .

[66]  J. G. Ndiritu,et al.  Application of radial basis function neural networks to short-term streamflow forecasting , 2010 .

[67]  Ashish Sharma,et al.  An information theoretic alternative to model a natural system using observational information alone , 2014 .

[68]  K. Zou,et al.  Correlation and simple linear regression. , 2003, Radiology.

[69]  Andreas Bender,et al.  Melting Point Prediction Employing k-Nearest Neighbor Algorithms and Genetic Parameter Optimization , 2006, J. Chem. Inf. Model..

[70]  Asaad Y. Shamseldin,et al.  A comparison between wavelet based static and dynamic neural network approaches for runoff prediction , 2016 .

[71]  Qiang Huang,et al.  Reference evapotranspiration forecasting based on local meteorological and global climate information screened by partial mutual information , 2018, Journal of Hydrology.