Text Mining the Contributors to Rail Accidents

Rail accidents represent an important safety concern for the transportation industry in many countries. In the 11 years from 2001 to 2012, the U.S. had more than 40 000 rail accidents that cost more than $45 million. While most of the accidents during this period had very little cost, about 5200 had damages in excess of $141 500. To better understand the contributors to these extreme accidents, the Federal Railroad Administration has required the railroads involved in accidents to submit reports that contain both fixed field entries and narratives that describe the characteristics of the accident. While a number of studies have looked at the fixed fields, none have done an extensive analysis of the narratives. This paper describes the use of text mining with a combination of techniques to automatically discover accident characteristics that can inform a better understanding of the contributors to the accidents. The study evaluates the efficacy of text mining of accident narratives by assessing predictive performance for the costs of extreme accidents. The results show that predictive accuracy for accident costs significantly improves through the use of features found by text mining and predictive accuracy further improves through the use of modern ensemble methods. Importantly, this study also shows through case examples how the findings from text mining of the narratives can improve understanding of the contributors to rail accidents in ways not possible through only fixed field analysis of the accident reports.

[1]  Andrew E. Smith,et al.  Evaluation of unsupervised semantic mapping of natural language with Leximancer concept mapping , 2006, Behavior research methods.

[2]  Hui Wang,et al.  Web-Based Traffic Sentiment Analysis: Methods and Applications , 2014, IEEE Transactions on Intelligent Transportation Systems.

[3]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[4]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[7]  Carsten Hasberg,et al.  Probabilistic Rail Vehicle Localization With Eddy Current Sensors in Topological Maps , 2011, IEEE Transactions on Intelligent Transportation Systems.

[8]  John Elder,et al.  Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications , 2012 .

[9]  Matthew L. Jensen,et al.  Detecting Concealment of Intent in Transportation Screening: A Proof of Concept , 2009, IEEE Transactions on Intelligent Transportation Systems.

[10]  Matt Taddy,et al.  Multinomial Inverse Regression for Text Analysis , 2010, 1012.2098.

[11]  Yang Zhao,et al.  Text mining based fault diagnosis of vehicle on-board equipment for high speed railway , 2014, 17th International IEEE Conference on Intelligent Transportation Systems (ITSC).

[12]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[13]  Jiawei Han,et al.  Multidimensional Data Mining of Traffic Anomalies on Large-Scale Road Networks , 2011 .

[14]  Hairong Dong,et al.  Emergency Management of Urban Rail Transportation Based on Parallel Systems , 2013, IEEE Transactions on Intelligent Transportation Systems.

[15]  Xiaofeng Wang,et al.  The spatio-temporal modeling for criminal incidents , 2012, Security Informatics.

[16]  Eleonora D'Andrea,et al.  Real-Time Detection of Traffic From Twitter Stream Analysis , 2015, IEEE Transactions on Intelligent Transportation Systems.

[17]  Xin Wu,et al.  Improving Knowledge Discovery in Document Collections through Combining Text Retrieval and Link Analysis Techniques , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[18]  R. Cook,et al.  Partial inverse regression , 2007 .

[19]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[20]  Karen McClure,et al.  Risk Assessment of Positive Train Control by Using Simulation of Rare Events , 2012 .

[21]  Richi Nayak,et al.  Application of text mining in analysing road crashes for road asset management , 2010, WCE 2010.

[22]  Xiaofeng Wang,et al.  Automatic Crime Prediction Using Events Extracted from Twitter Posts , 2012, SBP.

[23]  Dragan Pamucar,et al.  Decision support model for prioritizing railway level crossings for safety improvements: Application of the adaptive neuro-fuzzy system , 2013, Expert Syst. Appl..

[24]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[25]  Darccedil,et al.  A neural network (NN) model to predict intersection crashes based upon driver, vehicle and roadway surface characteristics 1 , 2010 .

[26]  Xiaofeng Wang,et al.  Spatio-temporal modeling of criminal incidents using geographic, demographic, and twitter-derived information , 2012, 2012 IEEE International Conference on Intelligence and Security Informatics.

[27]  Lee D. Han,et al.  An Online Self-Learning Algorithm for License Plate Matching , 2013, IEEE Transactions on Intelligent Transportation Systems.

[28]  Li-Sian Tey,et al.  Modelling driver behaviour towards innovative warning devices at railway level crossings. , 2013, Accident; analysis and prevention.

[29]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.