Most complex aerospace systems involve large numbers of text reports relating to safety, maintenance, and associated issues. Some have thousands of reports, spanning decades. The Space Shuttle has over 100,000 reports from just the last two decades. Similarly, the Aviation Safety Reporting System (ASRS) database spans several decades and contains over 600,000 reports. These information repositories contain valuable information about system health, particularly about trends and recurring problems. However, repository volume and complexity can make human analysis dicult. Current methods for identifying recurring anomalies rely on a divide-and-conquer strategy. A system is decomposed into subsystems, and experts on those subsystems read and monitor current reports. Thus problems and anomalies can be tracked and trended. This methodology requires experts who can recall and integrate reports spanning potentially long time scales. Current reports are often related to ones from months and years past. Recall is aided by categorizing reports into specific anomaly categories. Tracking and trending are aided by monitoring category totals over time. Category utility, however, relies on the experts to correctly and consistently categorize reports. With multiple experts, inter-rater reliability must somehow be assured. Clearly, human reading, comprehension, and association with relevant prior reports, is essential to system health assurance. The weak point is in identification and recall of relevant prior reports. A decision support system that automatically analyzes reports and provides consistent discovery, characterization, and categorization would be extremely useful. This paper discusses recent innovations in the field of text mining that enable the automatic discovery of anomalies in such text repositories, using statistical and content-based clustering techniques, and the characterization and categorization of these anomalies using advanced classification algorithms. The first innovation discovers recurring anomalies using content-based and statistical clustering techniques. The system, known as the Recurring Anomaly Detection System (ReADS) performs comprehensive analysis of an entire information repository to discover recurring anomalies and presents the results in an intuitive interactive visualization for further investigation by experts. The second innovation, known as Mariana, automatically classifies documents into predetermined categories using an advanced classifier, known as a Support Vector Machine, along with a Markov Chain Monte Carlo simulation to find the best hyperparameters for the model. Another approach to finding predetermined categories, base on Non-negative Matrix Factorization, is also showing promising results. We discuss the application of these to the problems of discovering recurring anomalies and categorizing anomalies in space and aeronautics domains.
[1]
Shawn Wolfe,et al.
Wordplay: An Examination of Semantic Approaches to Classify Safety Reports
,
2007
.
[2]
Christopher J. C. Burges,et al.
A Tutorial on Support Vector Machines for Pattern Recognition
,
1998,
Data Mining and Knowledge Discovery.
[3]
H. Sebastian Seung,et al.
Learning the parts of objects by non-negative matrix factorization
,
1999,
Nature.
[4]
A.N. Srivastava,et al.
Discovering recurring anomalies in text reports regarding complex space systems
,
2005,
2005 IEEE Aerospace Conference.
[5]
Inderjit S. Dhillon,et al.
Generative model-based clustering of directional data
,
2003,
KDD '03.
[6]
David G. Stork,et al.
Pattern Classification
,
1973
.
[7]
Patrik O. Hoyer,et al.
Non-negative Matrix Factorization with Sparseness Constraints
,
2004,
J. Mach. Learn. Res..
[8]
Inderjit S. Dhillon,et al.
Generalized Nonnegative Matrix Approximations with Bregman Divergences
,
2005,
NIPS.
[9]
David G. Stork,et al.
Pattern Classification (2nd ed.)
,
1999
.
[10]
Nello Cristianini,et al.
An Introduction to Support Vector Machines and Other Kernel-based Learning Methods
,
2000
.
[11]
Yunde Jia,et al.
FISHER NON-NEGATIVE MATRIX FACTORIZATION FOR LEARNING LOCAL FEATURES
,
2004
.
[12]
A Gordon,et al.
Classification, 2nd Edition
,
1999
.
[13]
U. M. Feyyad.
Data mining and knowledge discovery: making sense out of data
,
1996
.
[14]
Daniel D. Lee,et al.
Multiplicative Updates for Classification by Mixture Models
,
2001,
NIPS.
[15]
Christopher V. Kopek,et al.
Anomaly Detection Using Nonnegative Matrix Factorization
,
2008
.
[16]
H. Sebastian Seung,et al.
Algorithms for Non-negative Matrix Factorization
,
2000,
NIPS.
[17]
Philip S. Yu,et al.
On Privacy-Preservation of Text and Sparse Binary Data with Sketches
,
2007,
SDM.