Classification of motor vehicle crash injury severity: A hybrid approach for imbalanced data.

This study aims to classify the injury severity in motor-vehicle crashes with both high accuracy and sensitivity rates. The dataset used in this study contains 297,113 vehicle crashes, obtained from the Michigan Traffic Crash Facts (MTCF) dataset, from 2016-2017. Similar to any other crash dataset, different accident severity classes are not equally represented in MTCF. To account for the imbalanced classes, several techniques have been used, including under-sampling and over-sampling. Using five classification learning models (i.e., Logistic regression, Decision tree, Neural network, Gradient boosting model, and Naïve Bayes classifier), we classify the levels of injury severity and attempt to improve the classification performance by two training-testing methods including Bootstrap aggregation (or bagging) and majority voting. Furthermore, due to the imbalance present in the dataset, we use the geometric mean (G-mean) to evaluate the classification performance. We show that the classification performance is the highest when bagging is used with decision trees, with over-sampling treatment for imbalanced data. The effect of treatments for the imbalanced data is maximized when under-sampling is combined with bagging. In addition to the original five classes of injury severity in the MTCF dataset, we consider two additional classification problems, one with two classes and the other with three classes, to (1) investigate the impact of the number of classes on the performance of classification models, and (2) enable comparing our results with the literature.

[1]  Aemal Khattak,et al.  Motor vehicle drivers' injuries in train-motor vehicle crashes. , 2015, Accident; analysis and prevention.

[2]  Hongzhi Guan,et al.  A multinomial logit model-Bayesian network hybrid approach for driver injury severity analyses in rear-end crashes. , 2015, Accident; analysis and prevention.

[3]  Li-Yen Chang,et al.  Analysis of traffic injury severity: an application of non-parametric classification tree techniques. , 2006, Accident; analysis and prevention.

[4]  A. Dobson An introduction to generalized linear models , 1990 .

[5]  Zong Tian,et al.  Investigating driver injury severity patterns in rollover crashes using support vector machine models. , 2016, Accident; analysis and prevention.

[6]  Harry Zhang,et al.  The Optimality of Naive Bayes , 2004, FLAIRS.

[7]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[8]  Wei Wang,et al.  Predicting crash likelihood and severity on freeways with real-time loop detector data. , 2013, Accident; analysis and prevention.

[9]  Brian D. Ripley,et al.  Pattern Recognition and Neural Networks , 1996 .

[10]  Wei Wang,et al.  Using support vector machine models for crash injury severity analysis. , 2012, Accident; analysis and prevention.

[11]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[12]  J. Friedman Stochastic gradient boosting , 2002 .

[13]  Robert P. W. Duin,et al.  Using two-class classifiers for multiclass classification , 2002, Object recognition supported by user interaction for service robots.

[14]  Xiaoyu Zhu,et al.  Modeling occupant-level injury severity: An application to large-truck crashes. , 2011, Accident; analysis and prevention.

[15]  Rajesh Paleti,et al.  A spatial generalized ordered response model to examine highway crash injury severity. , 2013, Accident; analysis and prevention.

[16]  Mohamed Abdel-Aty,et al.  Development of Artificial Neural Network Models to Predict Driver Injury Severity in Traffic Accidents at Signalized Intersections , 2001 .

[17]  Heaton T. Jeff,et al.  Introduction to Neural Networks with Java , 2005 .

[18]  Mohamed Abdel-Aty,et al.  Utilizing support vector machine in real-time crash risk evaluation. , 2013, Accident; analysis and prevention.

[19]  D. Hensher,et al.  A mixed generalized ordered response model for examining pedestrian and bicyclist injury severity level in traffic crashes. , 2008, Accident; analysis and prevention.

[20]  Stan Matwin,et al.  Learning When Negative Examples Abound , 1997, ECML.

[21]  Li-Yen Chang,et al.  Analysis of driver injury severity in truck-involved accidents using a non-parametric classification tree model , 2013 .

[22]  Peter A. Flach,et al.  Improving Accuracy and Cost of Two-class and Multi-class Probabilistic Classifiers Using ROC Curves , 2003, ICML.

[23]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[24]  Xuancheng Li,et al.  Predicting Driver Injury Severity in Single-Vehicle and Two-Vehicle Crashes with Boosted Regression Trees , 2015 .

[25]  Nathan Huynh,et al.  Analysis of driver injury severity in rural single-vehicle crashes. , 2012, Accident; analysis and prevention.

[26]  Gabriella Guasticchi,et al.  Home injuries mortality: sensitivity and specificity analysis of different data sources and operative definitions. , 2007, Accident; analysis and prevention.

[27]  A. Mathai,et al.  Understanding and using sensitivity, specificity and predictive values , 2008, Indian journal of ophthalmology.

[28]  Amirfarrokh Iranitalab,et al.  Comparison of four statistical and machine learning methods for crash severity prediction. , 2017, Accident; analysis and prevention.