Model Data Selection and Data Pre-processing Approaches

Data-based modeling relies on historical data without directly taking account of underlying physical processes in hydrology . So, real-world modeling of hydrological processes commonly requires a complex input structure and very lengthy training data to represent inherent complex dynamic systems. In cases where a large amount of input data is available, and all of which used for modeling, technical issues such as the increase in the computational complexity and lack of memory spaces have been observed. The likelihood of these problems occurring is much greater in the case of hydrological modeling, as these models possess high nonlinearity and a large number of parameters. Therefore, there is a definite need to identify proper techniques which adequately reduce the number of inputs and the required training data length in nonlinear models. Removing redundant inputs from all available input pools and deciding upon the optimum data length to make a reliable prediction are the main purposes of these approaches. This section of the book describes the abilities of novel techniques such as Gamma Test (GT), entropy theory (ET), Principle Component Analysis (PCA), cluster analysis (CA), Akaike’s Information Criterion (AIC ), and Bayesian Information Criterion (BIC ) in model data selection. The novelty of this work is that many of these approaches are used for the first time in hydrological modeling scenarios such as solar radiation estimation, rainfall-runoff modeling , and evapotranspiration modeling . Towards the end of this chapter, conventional data selection procedures such as the Cross-Correlation Approach (CCA), Cross-Validation Approach (CVA), and Data Splitting Approach (DSA) are explained in detail. These traditional approaches were used to check the authenticity of the newly applied methods in the later case study chapters.

[1]  R. Gray Entropy and Information Theory , 1990, Springer New York.

[2]  Nilgun B. Harmancioglu,et al.  WATER QUALITY MONITORING NETWORK DESIGN: A PROBLEM OF MULTI‐OBJECTIVE DECISION MAKING , 1992 .

[3]  Nenad Koncar,et al.  A note on the Gamma test , 1997, Neural Computing & Applications.

[4]  Carsten Peterson,et al.  Finding the Embedding Dimension and Variable Dependencies in Time Series , 1994, Neural Computation.

[5]  J. Amorocho,et al.  Entropy in the assessment of uncertainty in hydrologic systems and models , 1973 .

[6]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[7]  M. Agha,et al.  Finite Mixture Distribution , 1982 .

[8]  I. D. Wilson,et al.  Predicting the geo-temporal variations of crime and disorder , 2003 .

[9]  Vijay P. Singh,et al.  Evaluation of rainfall networks using entropy: I. Theoretical development , 1992 .

[10]  R. Tryon Cluster Analysis , 1939 .

[11]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[12]  David R. Anderson,et al.  Practical Use of the Information-Theoretic Approach , 1998 .

[13]  H. Akaike A new look at the statistical model identification , 1974 .

[14]  Antonia J. Jones,et al.  Neural models of arbitrary chaotic systems: construction and the role of time delayed feedback in control and synchronization , 2001 .

[15]  Antonia J. Jones,et al.  The Construction of Smooth Models using Irregular Embeddings Determined by a Gamma Test Analysis , 2002, Neural Computing & Applications.

[16]  William F. Caselton,et al.  Hydrologic Networks: Information Transmission , 1980 .

[17]  Takeo Maruyama,et al.  Evaluation of Rainfall Characteristics Using Entropy , 1998 .

[18]  A. J. Jones,et al.  A proof of the Gamma test , 2002, Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[19]  Douglas D. Moesel,et al.  CONSTRUCT VALIDITY OF AN OBJECTIVE (ENTROPY) CATEGORICAL MEASURE OF DIVERSIFICATION STRATEGY , 1993 .

[20]  P. Burman A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods , 1989 .

[21]  H. L. Le Roy,et al.  Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Vol. IV , 1969 .

[22]  Ludwig Boltzmann,et al.  Über die Beziehung eines allgemeinen mechanischen Satzes zum zweiten Hauptsatze der Wärmetheorie , 1970 .

[23]  H. Akaike A Bayesian analysis of the minimum AIC procedure , 1978 .

[24]  K. Florek,et al.  Sur la liaison et la division des points d'un ensemble fini , 1951 .

[25]  Wentian Li,et al.  How Many Genes are Needed for a Discriminant Microarray Data Analysis , 2001, physics/0104029.

[26]  Jimson Mathew,et al.  Runoff prediction using an integrated hybrid modelling scheme , 2009 .

[27]  Tahir Husain,et al.  HYDROLOGIC UNCERTAINTY MEASURE AND NETWORK DESIGN1 , 1989 .

[28]  Henri Theil,et al.  Economics and information theory , 1967 .

[29]  F. Mutua,et al.  The use of the Akaike Information Criterion in the identification of an optimum flood frequency model. , 1994 .

[30]  Michael O Finkelstein,et al.  The Application of an Entropy Theory of Concentration to the Clayton Act , 1967 .

[31]  P. Sneath The application of computers to taxonomy. , 1957, Journal of general microbiology.

[32]  Tom M. L. Wigley,et al.  Spatial patterns of precipitation in England and Wales and a revised , 1984 .

[33]  R. M. Cormack,et al.  A Review of Classification , 1971 .

[34]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[35]  I. Jolliffe Principal Component Analysis , 2002 .

[36]  R. Preisendorfer,et al.  A Significance Test for Principal Components Applied to a Cyclone Climatology , 1982 .

[37]  A. Jacquemin,et al.  Entropy Measure of Diversification and Corporate Growth , 1979 .

[38]  R. J. Adcock A Problem in Least Squares , 1878 .

[39]  A. O'Neill,et al.  Atmospheric multiple equilibria and non‐Gaussian behaviour in model simulations , 2001 .

[40]  Donald H. Burn,et al.  An entropy approach to data collection network design , 1994 .

[41]  K. Palepu Diversification strategy, profit performance and the entropy measure , 1985 .

[42]  Raymond E. Bonner,et al.  On Some Clustering Techniques , 1964, IBM J. Res. Dev..

[43]  Yi-Zeng Liang,et al.  Monte Carlo cross validation , 2001 .

[44]  J. Shao Linear Model Selection by Cross-validation , 1993 .

[45]  D. H. Freeman Statistical Decomposition Analysis , 1974 .

[46]  W. Briggs Statistical Methods in the Atmospheric Sciences , 2007 .

[47]  R. Tabony,et al.  A principal component and spectral analysis of European rainfall , 1981 .

[48]  Claude E. Shannon,et al.  A mathematical theory of communication , 1948, MOCO.

[49]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[50]  J. Shao AN ASYMPTOTIC THEORY FOR LINEAR MODEL SELECTION , 1997 .

[51]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[52]  Kevin J. Murphy,et al.  The effect of industrial diversity on state unemployment rate and per capita income , 2003 .

[53]  Marko P. Hekkert,et al.  R&D portfolios in environmentally friendly automotive propulsion: Variety, competition and policy implications , 2004 .

[54]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[55]  Donald F. Specht,et al.  Probabilistic neural networks and the polynomial Adaline as complementary techniques for classification , 1990, IEEE Trans. Neural Networks.

[56]  Mohammad Karamouz,et al.  Input data selection for solar radiation estimation , 2009 .

[57]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[58]  Tom G. Chapman,et al.  Entropy as a measure of hydrologic data uncertainty and model performance , 1986 .

[59]  L. Breiman Heuristics of instability and stabilization in model selection , 1996 .

[60]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[61]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[62]  J. W. Kidson,et al.  Eigenvector Analysis of Monthly Mean Surface Data , 1975 .

[63]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[64]  Vujica Yevjevich,et al.  Transfer of hydrologic information among river points , 1987 .

[65]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[66]  Vijay P. Singh,et al.  Entropy-Based Assessment of Water Quality Monitoring Networks , 2000 .

[67]  R. Dennis Cook,et al.  Cross-Validation of Regression Models , 1984 .