A method for comparing data splitting approaches for developing hydrological ANN models

Data splitting is an important step in the artificial neural network (ANN) development process whereby data are divided into training, test and validation subsets to ensure good generalization ability of the model. Considering that only one split of data is typically used when developing ANN models, data splitting has a significant impact on the performance of the final model by potentially introducing bias and variance into the model development process. Therefore, it is important to find a robust data splitting method which results in an ANN model that represents the underlying data generation process of a given dataset. In practice, ANN models developed using different data splitting methods are often assessed based on validation results. In previous research, however, it has been found that validation results alone are not adequate for assessing the performance of ANN models. Data splitting methods have the potential to bias the validation results by allocating extreme observations into the training set and therefore, the test and validation sets contain fewer patterns compared to the training set. Consequently, the generalization ability of the model may be compromised and the trained model cannot be adequately validated. This paper introduces a method to compare different data splitting methods for developing ANN models fairly. The methodology is applied to compare a number of well-known data splitting techniques in the context of some hydrological ANN modeling problems.

[1]  F Despagne,et al.  Neural networks in multivariate calibration. , 1998, The Analyst.

[2]  Ashu Jain,et al.  Integrated approach to model decomposed flow hydrograph using artificial neural network and conceptual techniques , 2006 .

[3]  Holger R. Maier,et al.  Optimal division of data for neural network models in water resources applications , 2002 .

[4]  K. P. Sudheer,et al.  Methods used for the development of neural networks for the prediction of water resource variables in river systems: Current status and future directions , 2010, Environ. Model. Softw..

[5]  Holger R. Maier,et al.  Exploring the impact of data splitting methods on artificial neural network models , 2012 .

[6]  Holger R. Maier,et al.  Application of partial mutual information variable selection to ANN forecasting of water quality in water distribution systems , 2008, Environ. Model. Softw..

[7]  A. Bárdossy,et al.  Robust estimation of hydrological model parameters , 2008 .

[8]  Holger R. Maier,et al.  Data splitting for artificial neural networks using SOM-based stratified sampling , 2010, Neural Networks.

[9]  William J. Welch,et al.  Computer-aided design of experiments , 1981 .

[10]  Ronald D. Snee,et al.  Validation of Regression Models: Methods and Examples , 1977 .

[11]  Greer B. Kingston Bayesian artificial neural networks in water resources engineering. , 2006 .

[12]  Ashish Sharma,et al.  Seasonal to interannual rainfall probabilistic forecasts for improved water supply management: Part 1 — A strategy for system predictor identification , 2000 .

[13]  Blake LeBaron,et al.  A Bootstrap Evaluation of the Effect of Data Splitting on Financial Time Series , 1996, IEEE Trans. Neural Networks.