A Data-driven feature selection and machine-learning model benchmark for the prediction of longitudinal dispersion coefficient

Longitudinal Dispersion(LD) is the dominant process of scalar transport in natural streams. An accurate prediction on LD coefficient(Dl) can produce a performance leap in related simulation. The emerging machine learning(ML) techniques provide a self-adaptive tool for this problem. However, most of the existing studies utilize an unproved quaternion feature set, obtained through simple theoretical deduction. Few studies have put attention on its reliability and rationality. Besides, due to the lack of comparative comparison, the proper choice of ML models in different scenarios still remains unknown. In this study, the Feature Gradient selector was first adopted to distill the local optimal feature sets directly from multivariable data. Then, a global optimal feature set (the channel width, the flow velocity, the channel slope and the cross sectional area) was proposed through numerical comparison of the distilled local optimums in performance with representative ML models. The channel slope is identified to be the key parameter for the prediction of LDC. Further, we designed a weighted evaluation metric which enables comprehensive model comparison. With the simple linear model as the baseline, a benchmark of single and ensemble learning models was provided. Advantages and disadvantages of the methods involved were also discussed. Results show that the support vector machine has significantly better performance than other models. Decision tree is not suitable for this problem due to poor generalization ability. Notably, simple models show superiority over complicated model on this low-dimensional problem, for their better balance between regression and generalization.

[1]  Roger A Falconer,et al.  Longitudinal dispersion coefficients in natural channels. , 2002, Water research.

[2]  Il Won Seo,et al.  Predicting Longitudinal Dispersion Coefficient in Natural Streams , 1998 .

[3]  Vijay P. Singh,et al.  Hydrologic Frequency Modeling , 1987 .

[4]  Meng Zhao,et al.  Tuning the hyper-parameters of CMA-ES with tree-structured Parzen estimators , 2018, 2018 Tenth International Conference on Advanced Computational Intelligence (ICACI).

[5]  Lior Rokach,et al.  Ensemble learning: A survey , 2018, WIREs Data Mining Knowl. Discov..

[6]  Nicolo Fusi,et al.  Feature Gradients: Scalable Feature Selection via Discrete Relaxation , 2019, ArXiv.

[7]  J. W. Elder The dispersion of marked fluid in turbulent shear flow , 1959, Journal of Fluid Mechanics.

[8]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[9]  Som Dutta,et al.  Prediction of longitudinal dispersion coefficients in natural rivers using genetic algorithm , 2009 .

[10]  William J. Welch,et al.  Computer-aided design of experiments , 1981 .

[11]  Thomas N. Keefer,et al.  Simple Method for Predicting Dispersion in Streams , 1974 .

[12]  T. Day,et al.  Longitudinal dispersion in natural channels , 1975 .

[13]  Geoffrey Ingram Taylor,et al.  The dispersion of matter in turbulent flow through a pipe , 1954, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[14]  H. Fischer Mixing in Inland and Coastal Waters , 1979 .

[15]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[16]  H. B. Fisher,et al.  Dispersion Predictions in Natural Streams , 1968 .

[17]  Donald R. Jones,et al.  A Taxonomy of Global Optimization Methods Based on Response Surfaces , 2001, J. Glob. Optim..

[18]  Hossien Riahi-Madvar,et al.  Pareto Optimal Multigene Genetic Programming for Prediction of Longitudinal Dispersion Coefficient , 2019, Water Resources Management.

[19]  Wenxin Huai,et al.  Estimating the Longitudinal Dispersion Coefficient in Straight Natural Rivers , 2016 .

[20]  K. G. Ranga Raju,et al.  Longitudinal Dispersion of Conservative Pollutants in Open Channels , 2007 .

[21]  George V. Sabol,et al.  Empirical data on longitudinal dispersion in rivers , 1974 .

[22]  Wenxin Huai,et al.  Physically sound formula for longitudinal dispersion coefficients of natural rivers , 2017 .

[23]  Hossien Riahi-Madvar,et al.  A novel equation for longitudinal dispersion coefficient prediction based on the hybrid of SSMD and whale optimization algorithm. , 2020, The Science of the total environment.

[24]  G. Taylor Dispersion of soluble matter in solvent flowing slowly through a tube , 1953, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[25]  Gregory Valiant,et al.  Estimating Learnability in the Sublinear Data Regime , 2018, NeurIPS.

[26]  Masaki Onishi,et al.  Multiobjective tree-structured parzen estimator for computationally expensive optimization problems , 2020, GECCO.

[27]  H. Fischer,et al.  Longitudinal dispersion in laboratory and natural streams , 1966 .

[28]  W. Huai,et al.  Estimation of longitudinal dispersion coefficient in rivers , 2014 .

[29]  A. A. Mahboubi,et al.  Predictive equation for longitudinal dispersion coefficient , 2015 .

[30]  Minghao Yin,et al.  Differential Evolution for Prediction of Longitudinal Dispersion Coefficients in Natural Streams , 2013, Water Resources Management.