Natural language processing for aviation safety reports: From classification to interactive analysis

HighlightsWe identify the needs for NLP for the analysis of aviation safety reports.Automatic document classification can be performed.Probabilistic topic modelling is of limited use for identifying new aspects.We present an interactive search tool that relies on document similarity.We present another interactive tool that helps experts identify non-technical aspects. In this paper we describe the different NLP techniques designed and used in collaboration between the CLLE-ERSS research laboratory and the CFH/Safety Data company to manage and analyse aviation incident reports. These reports are written every time anything abnormal occurs during a civil air flight. Although most of them relate routine problems, they are a valuable source of information about possible sources of greater danger. These texts are written in plain language, show a wide range of linguistic variation (telegraphic style overcrowded by acronyms or standard prose) and exist in different languages, even for a single company/country (although our main focus is on English and French). In addition to their variety, their sheer quantity (e.g. 600/month for a large airline company) clearly requires the use of advanced NLP and text mining techniques in order to extract useful information from them. Although this context and objectives seem to indicate that standard NLP techniques can be applied in a straightforward manner, innovative techniques are required to handle the specifics of aviation report text and the complex classification systems. We present several tools that aim at a better access to this data (classification and information retrieval), and help aviation safety experts in their analyses (data/text mining and interactive analysis).Some of these tools are currently in test or in use both at the national and international levels, by airline companies as well as by regulation authorities (DGAC,11Direction Generale de l'Aviation Civile. EASA,22European Aviation Safety Agency. ICAO33International Civil Aviation Organization.).

[1]  Assaf Urieli,et al.  Robust French syntax analysis: reconciling statistical methods and linguistic knowledge in the Talismane toolkit. (Analyse syntaxique robuste du français : concilier méthodes statistiques et connaissances linguistiques dans l'outil Talismane) , 2013 .

[2]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[3]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[4]  Ludovic Tanguy,et al.  Natural Language Processing (NLP) tools for the analysis of incident and accident reports , 2012 .

[5]  Charles Johnson,et al.  Software tools to support incident reporting in safety-critical systems , 2002 .

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  Daniel Jurafsky,et al.  Studying the History of Ideas Using Topic Models , 2008, EMNLP.

[8]  Reinhard Menzel ICAO SAFETY DATABASE STRENGTHENED BY INTRODUCTION OF NEW SOFTWARE , 2004 .

[9]  Dan Roth,et al.  Margin-Based Active Learning for Structured Output Spaces , 2006, ECML.

[10]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[11]  Amanda Spink,et al.  Patterns of query reformulation during Web searching , 2009, J. Assoc. Inf. Sci. Technol..

[12]  Padhraic Smyth,et al.  Analyzing Entities and Topics in News Articles Using Statistical Topic Models , 2006, ISI.

[13]  R. Helmreich On error management: lessons from aviation , 2000, BMJ : British Medical Journal.

[14]  Moshe Ben-Akiva,et al.  Text analysis in incident duration prediction , 2013 .

[15]  Karen Sparck Jones,et al.  Book Reviews: Evaluating Natural Language Processing Systems: An Analysis and Review , 1996, CL.

[16]  Chia-Hua Ho,et al.  Large-scale linear support vector regression , 2012, J. Mach. Learn. Res..

[17]  Fredrik Olsson,et al.  A literature survey of active machine learning in the context of natural language processing , 2009 .

[18]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[19]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[20]  David B. Dunson,et al.  Probabilistic topic models , 2011, KDD '11 Tutorials.

[21]  Paul A. Viola,et al.  Interactive Information Extraction with Constrained Conditional Random Fields , 2004, AAAI.