Automated content analysis for construction safety: A natural language processing system to extract precursors and outcomes from unstructured injury reports

Abstract In the United States like in many other countries throughout the globe, construction workers are more likely to be injured on the job than workers in any other industry. This poor safety performance is responsible for huge human and financial losses and has motivated extensive research. Unfortunately, safety improvement in construction has decelerated in the last decade and traditional safety programs have reached saturation. Yet major construction companies and federal agencies possess a wealth of empirical knowledge in the form of huge databases of digital construction injury reports. This knowledge could be used to better understand, predict, and prevent the occurrence of construction accidents. Unfortunately, due to the lack of a clear methodology and the high costs of manual large-scale content analysis, these valuable data have yet to be extracted and leveraged. Recently, researchers have proposed a framework allowing meaningful empirical data to be extracted from accident reports. However, the resource limitations inherent to manual content analysis still remain. The present study tested the proposition that manual content analysis of injury reports can be eliminated using natural language processing (NLP). This paper describes (1) the overall strategy and methodology used in developing the system, and specifically how key challenges with decoding unstructured reports were overcome; (2) how the system was built through an iterative process of coding and testing against manual content analysis results from a team of seven independent analysts; and (3) the implications and potential uses of the data extracted. The results indicate that the NLP system is capable of quickly and automatically scanning unstructured injury reports for 101 attributes and outcomes with over 95% accuracy. The main contribution of this research is to empower any organization to quickly obtain a large and highly reliable structured attribute and outcome data set from their databases of unstructured accident reports. Such structured data are a necessary prerequisite to the application of statistical modeling techniques, allowing the extraction of new safety knowledge and finally the amelioration of safety management.

[1]  Kurt Hornik,et al.  Text Mining Infrastructure in R , 2008 .

[2]  M. Lombard,et al.  Content Analysis in Mass Communication: Assessment and Reporting of Intercoder Reliability , 2002 .

[3]  Carlos H. Caldas,et al.  Automating hierarchical document classification for construction management information systems , 2003 .

[4]  Matthew R. Hallowell,et al.  Attribute-based risk model for measuring safety risk of struck-by accidents , 2012 .

[5]  David R. Musicant,et al.  Understanding Support Vector Machine Classifications via a Recommender System-Like Approach , 2009, DMIN.

[6]  Leysia Palen,et al.  Natural Language Processing to the Rescue? Extracting "Situational Awareness" Tweets During Mass Emergency , 2011, ICWSM.

[7]  Behzad Esmaeili Identifying and quantifying construction safety risks at the attribute level , 2012 .

[8]  J. Popp,et al.  Sample size planning for classification models. , 2012, Analytica chimica acta.

[9]  Wen-der Yu,et al.  Content-based text mining technique for retrieval of CAD documents , 2013 .

[10]  Éric Gaussier,et al.  A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation , 2005, ECIR.

[11]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[12]  W Haddon,et al.  Energy Damage and the Ten Countermeasure Strategies1 , 1973, The Journal of trauma.

[13]  Ken-Yu Lin,et al.  Using ontology-based text classification to assist Job Hazard Analysis , 2014, Adv. Eng. Informatics.

[14]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[15]  John A. Gambatese,et al.  Activity-Based Safety Risk Quantification for Concrete Formwork Construction , 2009 .

[16]  Marc Prades Attribute-based Risk Model for Assessing Risk to Industrial Construction Tasks , 2014 .

[17]  Rafael Sacks,et al.  Spatial and Temporal Exposure to Safety Hazards in Construction , 2009 .

[18]  Alon Lavie,et al.  Combining Rule-based and Data-driven Techniques for Grammatical Relation Extraction in Spoken Language , 2003, IWPT.

[19]  Aviad Shapira,et al.  Identification and Analysis of Factors Affecting Safety on Construction Sites with Tower Cranes , 2009 .

[20]  William W. S. Wei Time Series Analysis , 2013 .

[21]  Xue Bai,et al.  Predicting consumer sentiments from online text , 2011, Decis. Support Syst..

[22]  Brendan J. Frey,et al.  Combination of statistical and rule-based approaches for spoken language understanding , 2002, INTERSPEECH.

[23]  Mani Golparvar-Fard,et al.  Enhancing construction hazard recognition with high-fidelity augmented virtuality , 2014 .

[24]  Martin G. Helander,et al.  Safety hazards and motivation for safe work in the construction industry , 1991 .

[25]  Fredric C. Gey,et al.  The relationship between recall and precision , 1994 .

[26]  Carlos H. Caldas,et al.  Management and analysis of unstructured construction data types , 2008, Adv. Eng. Informatics.

[27]  Frank Boukamp,et al.  Ontology-Based Representation and Reasoning Framework for Supporting Job Hazard Analysis , 2011, J. Comput. Civ. Eng..

[28]  Desheng Dash Wu,et al.  Using text mining and sentiment analysis for online forums hotspot detection and forecast , 2010, Decis. Support Syst..

[29]  F. Collins,et al.  A vision for the future of genomics research , 2003, Nature.

[30]  Matthew R. Hallowell A formal model for construction safety and health risk management , 2008 .

[31]  William L. Kuechler Business applications of unstructured text , 2007, CACM.

[32]  Mumtaz Usmen,et al.  Comparative Injury and Fatality Risk Analysis of Building Trades , 2006 .

[33]  Matthieu Desvignes,et al.  Requisite empirical risk data for integration of safety with advanced technologies and intelligent systems , 2014 .

[34]  Ramon Ferrer i Cancho,et al.  The small world of human language , 2001, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[35]  Eric Tsui,et al.  A knowledge extraction and representation system for narrative analysis in the construction industry , 2014, Expert Syst. Appl..

[36]  Justin Grimmer,et al.  Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts , 2013, Political Analysis.

[37]  Donald Hindle,et al.  Acquiring Disambiguation Rules from Text , 1989, ACL.

[38]  Gary King,et al.  A Method of Automated Nonparametric Content Analysis for Social Science , 2010 .

[39]  Rudy Prabowo,et al.  Sentiment analysis: A combined approach , 2009, J. Informetrics.

[40]  Gobinda G. Chowdhury,et al.  Natural language processing , 2005, Annu. Rev. Inf. Sci. Technol..

[41]  Alexandros Karatzoglou,et al.  Kernel-based machine learning for fast text mining in R , 2010, Comput. Stat. Data Anal..

[42]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[43]  Amr Kandil,et al.  Automatic clustering of construction project documents based on textual similarity , 2014 .

[44]  O. Zienkiewicz The Finite Element Method In Engineering Science , 1971 .

[45]  Moshe Ben-Akiva,et al.  Text analysis in incident duration prediction , 2013 .