Bayesian methods: a useful tool for classifying injury narratives into cause groups

To compare two Bayesian methods (Fuzzy and Naïve) for classifying injury narratives in large administrative databases into event cause groups, a dataset of 14 000 narratives was randomly extracted from claims filed with a worker’s compensation insurance provider. Two expert coders assigned one-digit and two-digit Bureau of Labor Statistics (BLS) Occupational Injury and Illness Classification event codes to each narrative. The narratives were separated into a training set of 11 000 cases and a prediction set of 3000 cases. The training set was used to develop two Bayesian classifiers that assigned BLS codes to narratives. Each model was then evaluated for the prediction set. Both models performed well and tended to predict one-digit BLS codes more accurately than two-digit codes. The overall sensitivity of the Fuzzy method was, respectively, 78% and 64% for one-digit and two-digit codes, specificity was 93% and 95%, and positive predictive value (PPV) was 78% and 65%. The Naïve method showed similar accuracy: a sensitivity of 80% and 70%, specificity of 96% and 97%, and PPV of 80% and 70%. For large administrative databases, Bayesian methods show significant promise as a means of classifying injury narratives into cause groups. Overall, Naïve Bayes provided slightly more accurate predictions than Fuzzy Bayes.

[1]  M R Lehto,et al.  Machine learning of motor vehicle accident categories from narrative data. , 1996, Methods of information in medicine.

[2]  T A Ranney,et al.  Motor vehicle crashes in roadway construction workzones: an analysis using narrative text from insurance claims. , 1996, Accident; analysis and prevention.

[3]  G Berenholz,et al.  Completeness and accuracy of International Classification of Disease (ICD) external cause of injury codes in emergency department electronic data , 2007, Injury Prevention.

[4]  Holly Hedegaard,et al.  Strategies to improve external cause-of-injury coding in state-based hospital discharge and emergency department data systems: recommendations of the CDC Workgroup for Improvement of External Cause-of-Injury Coding. , 2008, MMWR. Recommendations and reports : Morbidity and mortality weekly report. Recommendations and reports.

[5]  T K Courtney,et al.  Welding related occupational eye injuries: a narrative analysis , 2005, Injury Prevention.

[6]  Mark R Lehto,et al.  Computerized coding of injury narrative data from the National Health Interview Survey. , 2004, Accident; analysis and prevention.

[7]  Mark R. Lehto,et al.  Computer Classification of Injury Narratives Using a Fuzzy Bayes Approach: Improving the Model , 2007, HCI.

[8]  Mark R. Lehto,et al.  Development of an Approach for Optimizing the Accuracy of Classifying Claims Narratives Using a Machine Learning Tool (TEXTMINER[4]) , 2007, HCI.

[9]  T K Courtney,et al.  Using narrative text and coded data to develop hazard scenarios for occupational injury interventions , 2004, Injury Prevention.

[10]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[11]  T. Driscoll,et al.  Use of narrative analysis for comparisons of the causes of fatal accidents in three countries: New Zealand, Australia, and the United States , 2001, Injury prevention : journal of the International Society for Child and Adolescent Injury Prevention.

[12]  Mark R. Lehto,et al.  Hybrid singular value decomposition; a model of human text classification , 2006 .

[13]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.