Much of the information about work related injuries and illnesses in the U.S. is recorded only as short text narratives on Occupational Safety and Health Administration (OSHA) logs and Worker’s Compensation records. Analysis of these data has the potential to answer many important questions about workplace safety, but typically requires that the individual cases be “coded” first to indicate their specific characteristics. Unfortunately the process of assigning these codes is often manual, time consuming, and prone to human error. This paper compares manual and automated approaches to assigning detailed occupation, nature of injury, part of body, event resulting injury, and source of injury codes to narratives collected through the Survey of Occupational Injuries and Illnesses, an annual survey of U.S. establishments that collects OSHA logs describing approximately 300,000 work related injuries and illnesses each year. We review previous efforts to automate similar coding tasks and demonstrate that machine learning coders based on the logistic regression and support vector machine algorithms outperform those based on naive Bayes, and achieve coding accuracies comparable to or better than trained human coders.
[1]
David L. Waltz,et al.
Trading MIPS and memory for knowledge engineering
,
1992,
CACM.
[2]
Chih-Jen Lin,et al.
LIBLINEAR: A Library for Large Linear Classification
,
2008,
J. Mach. Learn. Res..
[3]
Michael I. Jordan,et al.
On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes
,
2001,
NIPS.
[4]
K. Bretonnel Cohen,et al.
A shared task involving multi-label classification of clinical free text
,
2007,
BioNLP@ACL.
[5]
Vangelis Metsis,et al.
Spam Filtering with Naive Bayes - Which Naive Bayes?
,
2006,
CEAS.
[6]
M Lehto,et al.
Bayesian methods: a useful tool for classifying injury narratives into cause groups
,
2009,
Injury Prevention.
[7]
S.J. Bertke,et al.
Development and evaluation of a Naïve Bayesian model for coding causation of workers' compensation claims.
,
2012,
Journal of safety research.
[8]
Gaël Varoquaux,et al.
Scikit-learn: Machine Learning in Python
,
2011,
J. Mach. Learn. Res..
[9]
Fabrizio Sebastiani,et al.
Machine learning in automated text categorization
,
2001,
CSUR.