Progress in the BBN keyword search system for the DARPA RATS program

This paper presents a set of techniques that we used to improve our keyword search system for the third phase of the DARPA RATS (Robust Automatic Transcription of Speech) program, which seeks to advance state of the art detection capabilities on audio from highly degraded radio communication channels. The results for both Levantine and Farsi, which are the two target languages for the keyword search (KWS) task, are reported. About 13% absolute reduction in word error rate (from 70.2% to 57.6%) is achieved by using acoustic features derived from stacked Multi-Layer Perceptrons (MLP) and Deep Neural Network (DNN) acoustic models. In addition to score normalization and score/system combination for keyword search, we showed that the false alarm rate at the target false reject rate (15%) was reduced by about 1% (from 5.39% to 4.45%) by reducing the deletion errors of the speech-to-text system. Index Terms: speech recognition, KWS, MLP, DNN

[1]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[2]  Dong Yu,et al.  Error back propagation for sequence training of Context-Dependent Deep NetworkS for conversational speech transcription , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  J Tchorz,et al.  A model of auditory perception as front end for automatic speech recognition. , 1999, The Journal of the Acoustical Society of America.

[4]  Lukás Burget,et al.  Investigation into bottle-neck features for meeting speech recognition , 2009, INTERSPEECH.

[5]  Richard M. Schwartz,et al.  Progress in transcription of Broadcast News using Byblos , 2002, Speech Commun..

[6]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[7]  Richard M. Schwartz,et al.  White Listing and Score Normalization for Keyword Spotting of Noisy Speech , 2012, INTERSPEECH.

[8]  Spyridon Matsoukas,et al.  Region Dependent Transform on MLP Features for Speech Recognition , 2011, INTERSPEECH.

[9]  M. J. D. Powell,et al.  An efficient method for finding the minimum of a function of several variables without calculating derivatives , 1964, Comput. J..

[10]  Jan Cernocký,et al.  BUT BABEL system for spontaneous Cantonese , 2013, INTERSPEECH.

[11]  Hynek Hermansky,et al.  Static and dynamic modulation spectrum for speech recognition , 2009, INTERSPEECH.

[12]  Richard M. Schwartz,et al.  Score normalization and system combination for improved keyword spotting , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[13]  Georg Heigold,et al.  An empirical study of learning rates in deep neural networks for speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.