A comparison of statistical learning methods for deriving determining factors of accident occurrence from an imbalanced high resolution dataset.

One of the main aims of accident data analysis is to derive the determining factors associated with road traffic accident occurrence. While current studies mainly use variants of count data regression to achieve this aim, the problem can also be considered as a binary classification task, with the dichotomous target variable indicating events (accidents) and non-events (no accidents). The effects of 45 variables - describing road condition and geometry, traffic volume and regulations, weather, and accident time - are analyzed using a dataset in high temporal (1 h) and spatial (250 m) resolution, covering the whole highway network of Austria over the period of four consecutive years. A combination of synthetic minority oversampling and maximum dissimilarity undersampling is used to balance the training dataset. We employ and compare a series of statistical learning techniques with respect to their predictive performance and discuss the importance of determining factors of accident occurrence from the ensemble of models. Findings substantiate that a trade-off between accuracy and sensitivity is inherent to imbalanced classification problems. Results show satisfying performance of tree-based methods which exhibit accuracies between 75% and 90% while exhibiting sensitivities between 30% and 50%. Overall, this analysis emphasizes the merits of using high-resolution data in the context of accident analysis.

[1]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[2]  Carla E. Brodley,et al.  Class Imbalance, Redux , 2011, 2011 IEEE 11th International Conference on Data Mining.

[3]  Bartosz Krawczyk,et al.  Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets , 2016, Pattern Recognit..

[4]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[5]  Max Kuhn,et al.  Applied Predictive Modeling , 2013 .

[6]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[7]  George Yannis,et al.  A review of the effect of traffic and weather characteristics on road safety. , 2014, Accident; analysis and prevention.

[8]  P. Rietveld,et al.  The impact of climate change and weather on transport: An overview of empirical findings , 2009 .

[9]  F. Feroz,et al.  MultiNest: an efficient and robust Bayesian inference tool for cosmology and particle physics , 2008, 0809.3437.

[10]  Jose Weissmann,et al.  Exploring rainfall impacts on the crash risk on Texas roadways: A crash-based matched-pairs analysis approach. , 2018, Accident; analysis and prevention.

[11]  Dave Winkler,et al.  Bayesian Regularization of Neural Networks , 2009, Artificial Neural Networks.

[12]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[13]  Julia B Edwards,et al.  WEATHER-RELATED ROAD ACCIDENTS IN ENGLAND AND WALES: A SPATIAL ANALYSIS / , 1996 .

[14]  A. Merloni,et al.  X-ray spectral modelling of the AGN obscuring region in the CDFS: Bayesian model selection and catalogue , 2014, 1402.0004.

[15]  Matthias Schlögl,et al.  Methodological considerations with data uncertainty in road safety analysis. , 2017, Accident; analysis and prevention.

[16]  Georg Heinze,et al.  A comparative investigation of methods for logistic regression with separated or nearly separated data , 2006, Statistics in medicine.

[17]  Benjamin Hofner,et al.  Model-based boosting in R: a hands-on tutorial using the R package mboost , 2012, Computational Statistics.

[18]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[19]  Daniel J. Graham,et al.  Road traffic accident prediction modelling: a literature review , 2017 .

[20]  Bernhard Steinauer,et al.  The Weighted Longitudinal Profile , 2008 .

[21]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[22]  Daniel J. Nordman,et al.  Case-Specific Random Forests , 2016 .

[23]  Feng Chen,et al.  Analysis of hourly crash likelihood using unbalanced panel data mixed logit model and real-time driving environmental big data. , 2018, Journal of safety research.

[24]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[25]  Torsten Hothorn,et al.  Boosting additive models using component-wise P-Splines , 2008, Comput. Stat. Data Anal..

[26]  George Yannis,et al.  Explaining the road accident risk: weather effects. , 2013, Accident; analysis and prevention.

[27]  J. Andrey,et al.  Weather as a Chronic Hazard for Road Transportation in Canadian Cities , 2003 .

[28]  George Yannis,et al.  Predicting road accidents: a rare-events modeling approach , 2016 .

[29]  D. Firth Bias reduction of maximum likelihood estimates , 1993 .

[30]  Geert Wets,et al.  Studying the effect of weather conditions on daily crash counts using a discrete time-series model. , 2008, Accident; analysis and prevention.

[31]  Peter Willett,et al.  Dissimilarity-Based Algorithms for Selecting Structurally Diverse Sets of Compounds , 1999, J. Comput. Biol..

[32]  M. P. Hobson,et al.  polychord: nested sampling for cosmology , 2015, Monthly Notices of the Royal Astronomical Society: Letters.

[33]  Andreas Ziegler,et al.  ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R , 2015, 1508.04409.

[34]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[35]  Evangelos Spiliotis,et al.  Statistical and Machine Learning forecasting methods: Concerns and ways forward , 2018, PloS one.

[36]  D. Eisenberg The mixed effects of precipitation on traffic crashes. , 2004, Accident; analysis and prevention.

[37]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[38]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[39]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[40]  Peter Buhlmann,et al.  BOOSTING ALGORITHMS: REGULARIZATION, PREDICTION AND MODEL FITTING , 2007, 0804.2752.

[41]  D. Sculley,et al.  Large Scale Learning to Rank , 2009 .

[42]  F. Mannering Temporal instability and the analysis of highway accident data , 2018 .

[43]  Fred L. Mannering,et al.  The statistical analysis of crash-frequency data: A review and assessment of methodological alternatives , 2010 .

[44]  Chandra R. Bhat,et al.  Analytic methods in accident research: Methodological frontier and future directions , 2014 .

[45]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[46]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[47]  F Mannering,et al.  Effect of roadway geometrics and environmental factors on rural freeway accident frequencies. , 1995, Accident; analysis and prevention.

[48]  Dirk Van den Poel,et al.  bayesQR: A Bayesian Approach to Quantile Regression , 2017 .

[49]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[50]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[51]  Dirk Van den Poel,et al.  Binary quantile regression: a Bayesian approach based on the asymmetric Laplace distribution , 2012 .

[52]  A. Lasenby,et al.  polychord: next-generation nested sampling , 2015, 1506.00171.

[53]  M J Maher,et al.  A comprehensive methodology for the fitting of predictive accident models. , 1996, Accident; analysis and prevention.

[54]  J. Friedman Stochastic gradient boosting , 2002 .

[55]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[56]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[57]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[58]  Athanasios Theofilatos,et al.  Incorporating real-time traffic and weather data to explore road accident likelihood and severity in urban arterials. , 2017, Journal of safety research.

[59]  Michael J Pencina,et al.  Discrimination slope and integrated discrimination improvement – properties, relationships and impact of calibration , 2017, Statistics in medicine.

[60]  J. Friedman Multivariate adaptive regression splines , 1990 .

[61]  Martin T. Hagan,et al.  Gauss-Newton approximation to Bayesian learning , 1997, Proceedings of International Conference on Neural Networks (ICNN'97).

[62]  P. Jones,et al.  A European daily high-resolution gridded data set of surface temperature and precipitation for 1950-2006 , 2008 .

[63]  M. Schemper,et al.  A solution to the problem of separation in logistic regression , 2002, Statistics in medicine.

[64]  D. Cox The Regression Analysis of Binary Sequences , 1958 .

[65]  Rahim Alhamzawi,et al.  Bayesian lasso binary quantile regression , 2013, Computational Statistics.

[66]  Chandra R. Bhat,et al.  Unobserved heterogeneity and the statistical analysis of highway accident data , 2016 .

[67]  Francisco Bravo,et al.  Real-time crash prediction in an urban expressway using disaggregated data , 2018 .

[68]  R. Steinacker,et al.  A Mesoscale Data Analysis and Downscaling Method over Complex Terrain , 2006 .

[69]  Mohamed Abdel-Aty,et al.  Crash risk analysis during fog conditions using real-time traffic data. , 2017, Accident; analysis and prevention.

[70]  Vikash V. Gayah,et al.  Crash Risk Assessment Using Intelligent Transportation Systems Data and Real-Time Intervention Strategies to Improve Safety on Freeways , 2007, J. Intell. Transp. Syst..

[71]  Peter L. Bartlett,et al.  Boosting Algorithms as Gradient Descent , 1999, NIPS.

[72]  Jean Andrey,et al.  Long-term trends in weather-related crash risks , 2010 .

[73]  C. Manski MAXIMUM SCORE ESTIMATION OF THE STOCHASTIC UTILITY MODEL OF CHOICE , 1975 .

[74]  Mohamed Abdel-Aty,et al.  Real-Time Crash Risk Prediction using Long Short-Term Memory Recurrent Neural Network , 2019, Transportation Research Record: Journal of the Transportation Research Board.

[75]  Daryl Lloyd,et al.  Modelling weather effects on road casualty statistics , 2016 .

[76]  Yves Deville,et al.  DiceKriging, DiceOptim: Two R Packages for the Analysis of Computer Experiments by Kriging-Based Metamodeling and Optimization , 2012 .

[77]  J. D. Malley,et al.  Probability Machines , 2011, Methods of Information in Medicine.

[78]  P Pérez-Rodríguez,et al.  Technical note: An R package for fitting Bayesian regularized neural networks with applications in animal breeding. , 2013, Journal of animal science.

[79]  Mohamed Abdel-Aty,et al.  Bayesian random effect models incorporating real-time weather and traffic data to investigate mountainous freeway hazardous factors. , 2013, Accident; analysis and prevention.

[80]  Bernard Widrow,et al.  Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[81]  Marco Vannucci,et al.  A method for resampling imbalanced datasets in binary classification tasks for real-world problems , 2014, Neurocomputing.