Construction site accident analysis using text mining and natural language processing techniques

Abstract Workplace safety is a major concern in many countries. Among various industries, construction sector is identified as the most hazardous work place. Construction accidents not only cause human sufferings but also result in huge financial loss. To prevent reoccurrence of similar accidents in the future and make scientific risk control plans, analysis of accidents is essential. In construction industry, fatality and catastrophe investigation summary reports are available for the past accidents. In this study, text mining and natural language process (NLP) techniques are applied to analyze the construction accident reports. To be more specific, five baseline models, support vector machine (SVM), linear regression (LR), K-nearest neighbor (KNN), decision tree (DT), Naive Bayes (NB) and an ensemble model are proposed to classify the causes of the accidents. Besides, Sequential Quadratic Programming (SQP) algorithm is utilized to optimize weight of each classifier involved in the ensemble model. Experiment results show that the optimized ensemble model outperforms rest models considered in this study in terms of average weighted F1 score. The result also shows that the proposed approach is more robust to cases of low support. Moreover, an unsupervised chunking approach is proposed to extract common objects which cause the accidents based on grammar rules identified in the reports. As harmful objects are one of the major factors leading to construction accidents, identifying such objects is extremely helpful to mitigate potential risks. Certain limitations of the proposed methods are discussed and suggestions and future improvements are provided.

[1]  S.J. Bertke,et al.  Development and evaluation of a Naïve Bayesian model for coding causation of workers' compensation claims. , 2012, Journal of safety research.

[2]  A R Duff,et al.  Contributing factors in construction accidents. , 2005, Applied ergonomics.

[3]  Matthew R. Hallowell,et al.  Application of machine learning to construction injury prediction , 2016 .

[4]  Mounir El Asmar,et al.  Analyzing Arizona OSHA injury reports using unsupervised machine learning , 2016 .

[5]  Fredric C. Gey,et al.  The relationship between recall and precision , 1994 .

[6]  Fan Zhang,et al.  Time series forecasting for building energy consumption using weighted Support Vector Regression with differential evolution optimization technique , 2016 .

[7]  Youdan Kim,et al.  Trajectory optimization for unmanned aerial vehicle formation reconfiguration , 2014 .

[8]  Jie Gong,et al.  Predicting construction cost overruns using text mining, numerical data and ensemble classifiers , 2014 .

[9]  Edward Sazonov,et al.  Detection of chewing from piezoelectric film sensor signals using ensemble classifiers , 2016, EMBC.

[10]  Yang Miang Goh,et al.  An Ensemble Approach for Classification of Accident Narratives , 2017 .

[11]  Mark R Lehto,et al.  Computerized coding of injury narrative data from the National Health Interview Survey. , 2004, Accident; analysis and prevention.

[12]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[13]  Youdan Kim,et al.  Trajectory Optimization for a Multi-Stage Launch Vehicle Using Time Finite Element and Direct Collocation Methods , 2002 .

[14]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[15]  Xingquan Zhu,et al.  Bagging very weak learners with lazy local learning , 2008, 2008 19th International Conference on Pattern Recognition.

[16]  Rafael Sacks,et al.  Assessing research issues in Automated Project Performance Control (APPC) , 2007 .

[17]  Alexander J. Smola,et al.  Support Vector Method for Function Approximation, Regression Estimation and Signal Processing , 1996, NIPS.

[18]  Heng Li,et al.  Retrieving similar cases for alternative dispute resolution in construction accidents using text mining techniques , 2013 .

[19]  Shalini Batra,et al.  HPCC: An ensembled framework for the prediction of the onset of diabetes , 2017, 2017 4th International Conference on Signal Processing, Computing and Control (ISPCC).

[20]  Arto Kiviniemi,et al.  Retrieving similar cases for construction project risk management using Natural Language Processing techniques , 2017 .

[21]  Matthew R. Hallowell,et al.  Automated content analysis for construction safety: A natural language processing system to extract precursors and outcomes from unstructured injury reports , 2016 .

[22]  Mark Lehto,et al.  Near-miss narratives from the fire service: a Bayesian analysis. , 2014, Accident; analysis and prevention.

[23]  Hermann Ney,et al.  From Feedforward to Recurrent LSTM Neural Networks for Language Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[25]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[26]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[27]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[28]  H. M. Al-Humaidi,et al.  Construction Safety in Kuwait , 2010 .

[29]  Laura Garach,et al.  Bayes classifiers for imbalanced traffic accidents datasets. , 2016, Accident; analysis and prevention.

[30]  Mark R Lehto,et al.  Classifying injury narratives of large administrative databases for surveillance-A practical approach combining machine learning ensembles and human review. , 2017, Accident; analysis and prevention.

[31]  D. Hosmer,et al.  Applied Logistic Regression , 1991 .

[32]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[33]  Francisco Javier García Castellano,et al.  A New Robust Classifier on Noise Domains: Bagging of Credal C4.5 Trees , 2017, Complex..

[34]  S Leclercq,et al.  Extracting recurrent scenarios from narrative texts using a Bayesian network: application to serious occupational accidents with movement disturbance. , 2014, Accident; analysis and prevention.

[35]  W. Karush Minima of Functions of Several Variables with Inequalities as Side Conditions , 2014 .

[36]  S J Bertke,et al.  Comparison of methods for auto-coding causation of injury narratives. , 2016, Accident; analysis and prevention.

[37]  Yang Miang Goh,et al.  Construction accident narrative classification: An evaluation of text mining techniques. , 2017, Accident; analysis and prevention.

[38]  José A. Gámez,et al.  Tackling the supervised label ranking problem by bagging weak learners , 2017, Inf. Fusion.

[39]  H. L. Le Roy,et al.  Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Vol. IV , 1969 .

[40]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[41]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[42]  Gobinda G. Chowdhury,et al.  Natural language processing , 2005, Annu. Rev. Inf. Sci. Technol..