Handling Imbalanced Data in Road Crash Severity Prediction by Machine Learning Algorithms

Crash severity is undoubtedly a fundamental aspect of a crash event. Although machine learning algorithms for predicting crash severity have recently gained interest by the academic community, there is a significant trend towards neglecting the fact that crash datasets are acutely imbalanced. Overlooking this fact generally leads to weak classifiers for predicting the minority class (crashes with higher severity). In this paper, in order to handle imbalanced accident datasets and provide a better prediction for the minority class, the random undersampling the majority class (RUMC) technique is used. By employing an imbalanced and a RUMC-based balanced training set, we propose the calibration, validation, and evaluation of four different crash severity predictive models, including random tree, k-nearest neighbor, logistic regression, and random forest. Accuracy, true positive rate (recall), false positive rate, true negative rate, precision, F1-score, and the confusion matrix have been calculated to assess the performance. Outcomes show that RUMC-based models provide an enhancement in the reliability of the classifiers for detecting fatal crashes and those causing injury. Indeed, in imbalanced models, the true positive rate for predicting fatal crashes and those causing injury spans from 0% (logistic regression) to 18.3% (k-nearest neighbor), while for the RUMC-based models, it spans from 52.5% (RUMC-based logistic regression) to 57.2% (RUMC-based k-nearest neighbor). Organizations and decision-makers could make use of RUMC and machine learning algorithms in predicting the severity of a crash occurrence, managing the present, and planning the future of their works.

[1]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[2]  Mahesh Pal,et al.  M5 model tree based predictive modeling of road accidents on non-urban sections of highways in India. , 2016, Accident; analysis and prevention.

[3]  Gyanendra Singh,et al.  Deep neural network-based predictive modeling of road accidents , 2020, Neural Computing and Applications.

[4]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[5]  Amirfarrokh Iranitalab,et al.  Comparison of four statistical and machine learning methods for crash severity prediction. , 2017, Accident; analysis and prevention.

[6]  Francisco Bravo,et al.  Real-time crash prediction in an urban expressway using disaggregated data , 2018 .

[7]  Jinjun Tang,et al.  Crash injury severity analysis using a two-layer Stacking framework. , 2019, Accident; analysis and prevention.

[8]  M. Hemalatha,et al.  A Perspective Analysis of Traffic Accident using Data Mining Techniques , 2011 .

[9]  Quan Yu,et al.  Traffic safety analysis on mixed traffic flows at signalized intersection based on Haar-Adaboost algorithm and machine learning , 2019 .

[10]  Madhar Taamneh,et al.  Severity Prediction of Traffic Accident Using an Artificial Neural Network , 2017 .

[11]  Eric T. Donnell,et al.  Application of a model-based recursive partitioning algorithm to predict crash frequency. , 2019, Accident; analysis and prevention.

[12]  Ziyuan Pu,et al.  Comparing Prediction Performance for Crash Injury Severity Among Various Machine Learning and Statistical Methods , 2018, IEEE Access.

[13]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[14]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[15]  Junhua Wang,et al.  Expressway crash risk prediction using back propagation neural network: A brief investigation on safety resilience. , 2019, Accident Analysis and Prevention.

[16]  J. Berkson Application of the Logistic Function to Bio-Assay , 1944 .

[17]  Jing Chen,et al.  Traffic Accident’s Severity Prediction: A Deep-Learning Approach-Based CNN Network , 2019, IEEE Access.

[18]  Xiugang Li,et al.  Predicting motor vehicle crashes using Support Vector Machine models. , 2008, Accident; analysis and prevention.

[19]  Cong Chen,et al.  Comparing Machine Learning and Deep Learning Methods for Real-Time Crash Prediction , 2019, Transportation Research Record: Journal of the Transportation Research Board.

[20]  Juan de Oña,et al.  Analysis of traffic accident severity using Decision Rules via Decision Trees , 2013, Expert Syst. Appl..

[21]  Mohamed Abdel-Aty,et al.  Analyzing angle crashes at unsignalized intersections using machine learning techniques. , 2011, Accident; analysis and prevention.

[22]  Wei Wang,et al.  Using support vector machine models for crash injury severity analysis. , 2012, Accident; analysis and prevention.

[23]  Keneth Morgan Kwayu,et al.  Comparison of Machine Learning Algorithms for Predicting Traffic Accident Severity , 2019, 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT).

[24]  Jianli Xiao,et al.  SVM and KNN ensemble learning for traffic incident detection , 2019, Physica A: Statistical Mechanics and its Applications.

[25]  Atorod Azizinamini,et al.  Improved Support Vector Machine Models for Work Zone Crash Injury Severity Prediction and Analysis , 2019, Transportation Research Record: Journal of the Transportation Research Board.

[26]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[27]  Haobin Jiang,et al.  Severity prediction of motorcycle crashes with machine learning methods , 2020, International Journal of Crashworthiness.

[28]  Mohamed Abdel-Aty,et al.  Development of Artificial Neural Network Models to Predict Driver Injury Severity in Traffic Accidents at Signalized Intersections , 2001 .

[29]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[30]  Mohamed Abdel-Aty,et al.  Modeling Real-Time Cycle-Level Crash Risk at Signalized Intersections Based on High-Resolution Event-Based Data , 2020, IEEE Transactions on Intelligent Transportation Systems.

[31]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[32]  Xuedong Yan,et al.  Exploring precrash maneuvers using classification trees and random forests. , 2009, Accident; analysis and prevention.

[33]  Mohamed Abdel-Aty,et al.  Comprehensive analysis of vehicle-pedestrian crashes at intersections in Florida. , 2005, Accident; analysis and prevention.

[34]  Mohamed Abdel-Aty,et al.  Real-time crash risk prediction on arterials based on LSTM-CNN. , 2019, Accident; analysis and prevention.

[35]  S. Travis Waller,et al.  An ensemble machine learning‐based modeling framework for analysis of traffic crash frequency , 2019, Comput. Aided Civ. Infrastructure Eng..

[36]  Jianping Zhang,et al.  Instance–Based Learning for Highway Accident Frequency Prediction , 1997 .

[37]  S. Larson The shrinkage of the coefficient of multiple correlation. , 1931 .

[38]  Muhammad Nashir Ardiansyah,et al.  An evaluation scheme for assessing the effectiveness of intersection movement assist (IMA) on improving traffic safety , 2018, Traffic injury prevention.

[39]  Biswajeet Pradhan,et al.  Severity Prediction of Traffic Accidents with Recurrent Neural Networks , 2017 .

[40]  Matthias Schlögl,et al.  A comparison of statistical learning methods for deriving determining factors of accident occurrence from an imbalanced high resolution dataset. , 2019, Accident; analysis and prevention.

[41]  Dursun Delen,et al.  Investigating injury severity risk factors in automobile crashes with predictive analytics and sensitivity analysis methods , 2017 .

[42]  Li-Yen Chang,et al.  Analysis of traffic injury severity: an application of non-parametric classification tree techniques. , 2006, Accident; analysis and prevention.

[43]  Zong Tian,et al.  Investigating driver injury severity patterns in rollover crashes using support vector machine models. , 2016, Accident; analysis and prevention.

[44]  Mahesh Pal,et al.  Support vector machine model for prediction of accidents on non-urban sections of highways , 2018, Proceedings of the Institution of Civil Engineers - Transport.

[45]  Marco Vannucci,et al.  A method for resampling imbalanced datasets in binary classification tasks for real-world problems , 2014, Neurocomputing.