Automatic gain control and multi-style training for robust small-footprint keyword spotting with deep neural networks

We explore techniques to improve the robustness of small-footprint keyword spotting models based on deep neural networks (DNNs) in the presence of background noise and in far-field conditions. We find that system performance can be improved significantly, with relative improvements up to 75% in far-field conditions, by employing a combination of multi-style training and a proposed novel formulation of automatic gain control (AGC) that estimates the levels of both speech and background noise. Further, we find that these techniques allow us to achieve competitive performance, even when applied to DNNs with an order of magnitude fewer parameters than our base-line.

[1]  Michael L. Seltzer Acoustic Model Training for Robust Speech Recognition , 2012, Techniques for Noise Robustness in Automatic Speech Recognition.

[2]  Georg Heigold,et al.  Small-footprint keyword spotting using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Aren Jansen,et al.  Inverting the Point Process Model for Fast Phonetic Keyword Search , 2012, INTERSPEECH.

[4]  Michaela Pfundmair,et al.  Your word is my command: Oxytocin facilitates the understanding of appeal in verbal communication , 2016, Psychoneuroendocrinology.

[5]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[6]  Samy Bengio,et al.  Discriminative keyword spotting , 2009, Speech Commun..

[7]  Jinyu Li,et al.  Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks. , 2013, ICLR 2013.

[8]  Richard F. Lyon,et al.  Automatic Gain Control in Cochlear Mechanics , 1990 .

[9]  Timothy J. Hazen,et al.  Query-by-example spoken term detection using phonetic posteriorgram templates , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[10]  Navdeep Jaitly,et al.  Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition , 2012, INTERSPEECH.

[11]  Francoise Beaufays,et al.  “Your Word is my Command”: Google Search by Voice: A Case Study , 2010 .

[12]  Xiaodong Cui,et al.  Developing speech recognition systems for corpus indexing under the IARPA Babel program , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Jonathan G. Fiscus,et al.  Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[14]  Santiago Celma Pueyo,et al.  Automatic Gain Control , 2011 .

[15]  DeLiang Wang,et al.  Investigation of Speech Separation as a Front-End for Noise Robust Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Peter L. Chu Voice-activated AGC for teleconferencing , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[17]  Geoffrey E. Hinton,et al.  On rectified linear units for speech processing , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Rohit Prabhavalkar,et al.  Discriminative articulatory models for spoken term detection in low-resource conversational settings , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Santiago Celma Pueyo,et al.  Automatic Gain Control: Techniques and Architectures for RF Receivers , 2011 .

[20]  Alexander Gruenstein,et al.  Accurate and compact large vocabulary speech recognition on mobile devices , 2013, INTERSPEECH.