论文信息 - Semi-Supervised Cause Identification from Aviation Safety Reports

Semi-Supervised Cause Identification from Aviation Safety Reports

We introduce cause identification, a new problem involving classification of incident reports in the aviation domain. Specifically, given a set of pre-defined causes, a cause identification system seeks to identify all and only those causes that can explain why the aviation incident described in a given report occurred. The difficulty of cause identification stems in part from the fact that it is a multi-class, multilabel categorization task, and in part from the skewness of the class distributions and the scarcity of annotated reports. To improve the performance of a cause identification system for the minority classes, we present a bootstrapping algorithm that automatically augments a training set by learning from a small amount of labeled data and a large amount of unlabeled data. Experimental results show that our algorithm yields a relative error reduction of 6.3% in F-measure for the minority classes in comparison to a baseline that learns solely from the labeled data.

Vincent Ng | Isaac Persing | Vincent Ng | Isaac Persing

[1] David Yarowsky,et al. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[2] ThrunSebastian,et al. Text Classification from Labeled and Unlabeled Documents using EM , 2000 .

[3] Claire Cardie,et al. Limitations of Co-Training for Natural Language Learning from Large Datasets , 2001, EMNLP.

[4] Michael J. Pazzani,et al. Reducing Misclassification Costs , 1994, ICML.

[5] A. Dobson,et al. Assessing agreement , 1989, The Medical journal of Australia.

[6] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[7] Jean Carletta,et al. Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[8] Yiming Yang,et al. A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[9] Paolo Rosso,et al. Taking Advantage of the Web for Text Classification with Imbalanced Classes , 2007, MICAI.

[10] Sebastian Thrun,et al. Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[11] Nitesh V. Chawla,et al. SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[12] Bo Pang,et al. Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[13] C. Posse,et al. Extracting information from narratives: an application to aviation safety reports , 2005, 2005 IEEE Aerospace Conference.

[14] Andrew McCallum,et al. Text Classification by Bootstrapping with Keywords, EM and Shrinkage , 1999 .

[15] Stan Matwin,et al. Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.