The Airbus Air Traffic Control speech recognition 2018 challenge: towards ATC automatic transcription and call sign detection

In this paper, we describe the outcomes of the challenge organized and run by Airbus and partners in 2018 on Air Traffic Control (ATC) speech recognition. The challenge consisted of two tasks applied to English ATC speech: 1) automatic speech-to-text transcription, 2) call sign detection (CSD). The registered participants were provided with 40 hours of speech along with manual transcriptions. Twenty-two teams submitted predictions on a five hour evaluation set. ATC speech processing is challenging for several reasons: high speech rate, foreign-accented speech with a great diversity of accents, noisy communication channels. The best ranked team achieved a 7.62% Word Error Rate and a 82.41% CSD F1-score. Transcribing pilots' speech was found to be twice as harder as controllers' speech. Remaining issues towards solving ATC ASR are also discussed in the paper.

[1]  Jan Svec,et al.  Semi-Supervised Training of DNN-Based Acoustic Model for ATC Speech Recognition , 2018, SPECOM.

[2]  Horst Hering,et al.  The ATCOSIM Corpus of Non-Prompted Clean Air Traffic Control Speech , 2008, LREC.

[3]  Lori Lamel,et al.  Do speech recognizers prefer female speakers? , 2005, INTERSPEECH.

[4]  Xiaodong Cui,et al.  English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[5]  Marion Laignelet,et al.  A Real-life, French-accented Corpus of Air Traffic Control Communications , 2018, LREC.

[6]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[7]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[8]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[9]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[11]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[12]  Pavel Ircing,et al.  Air Traffic Control Communication ( ATCC ) Speech Corpus , 2014 .

[13]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[14]  Gilles Boulianne,et al.  CRIM's System for the MGB-3 English Multi-Genre Broadcast Media Transcription , 2018, INTERSPEECH.

[15]  David A. van Leeuwen,et al.  Design and characterization of the non-native military air traffic communications database (nnMATC) , 2007, INTERSPEECH.

[16]  Anne Condamines,et al.  Linguistic Analysis of English Phraseology and Plain Language in Air-Ground Communication , 2011 .

[17]  Yiming Wang,et al.  Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks , 2018, INTERSPEECH.

[18]  A. Arnoux,et al.  Vocalise: assessing the impact of data link technology on the R/T channel , 2005, 24th Digital Avionics Systems Conference.