论文信息 - Query-by-example keyword spotting using long short-term memory networks

Query-by-example keyword spotting using long short-term memory networks

We present a novel approach to query-by-example keyword spotting (KWS) using a long short-term memory (LSTM) recurrent neural network-based feature extractor. In our approach, we represent each keyword using a fixed-length feature vector obtained by running the keyword audio through a word-based LSTM acoustic model. We use the activations prior to the softmax layer of the LSTM as our keyword-vector. At runtime, we detect the keyword by extracting the same feature vector from a sliding window and computing a simple similarity score between this test vector and the keyword vector. With clean speech, we achieve 86% relative false rejection rate reduction at 0.5% false alarm rate when compared to a competitive phoneme posteriorgram with dynamic time warping KWS system, while the reduction in the presence of babble noise is 67%. Our system has a small memory footprint, low computational cost, and high precision, making it suitable for on-device applications.

Tara N. Sainath | Carolina Parada | Guoguo Chen | Guoguo Chen | Carolina Parada

[1] Aren Jansen,et al. Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[2] Peter F. Patel-Schneider,et al. DLP System Description , 1998, Description Logics.

[3] Timothy J. Hazen,et al. Query-by-example spoken term detection using phonetic posteriorgram templates , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[4] Richard Rose,et al. A hidden Markov model based keyword recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[5] W. Russell,et al. Continuous hidden Markov modeling for speaker-independent word spotting , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[6] Francoise Beaufays,et al. “Your Word is my Command”: Google Search by Voice: A Case Study , 2010 .

[7] Bhuvana Ramabhadran,et al. Query-by-example Spoken Term Detection For OOV terms , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[8] Mireia Díez,et al. High-performance Query-by-Example Spoken Term Detection on the SWS 2013 evaluation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] David Yarowsky,et al. Quantifying the value of pronunciation lexicons for keyword search in lowresource languages , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10] L. G. Miller,et al. Improvements and applications for key word recognition using hidden Markov modeling techniques , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[11] Thad Hughes,et al. Recurrent neural networks for voice activity detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12] Lukás Burget,et al. BUT QUESST 2014 system description , 2014, MediaEval.

[13] Hervé Bourlard,et al. Iterative Posterior-Based Keyword Spotting Without Filler Models , 1999 .

[14] Sanjeev Khudanpur,et al. Using proxies for OOV keywords in the keyword search task , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[15] Alexander Gruenstein,et al. Accurate and compact large vocabulary speech recognition on mobile devices , 2013, INTERSPEECH.

[16] Sridha Sridharan,et al. A phonetic search approach to the 2006 NIST spoken term detection evaluation , 2007, INTERSPEECH.

[17] Marius-Calin Silaghi,et al. Spotting Subsequences Matching an HMM Using the Average Observation Probability Criteria with Application to Keyword Spotting , 2005, AAAI.

[18] Andreas Stolcke,et al. Open-vocabulary spoken term detection using graphone-based hybrid recognition systems , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19] Bhuvana Ramabhadran,et al. Vocabulary independent spoken term detection , 2007, SIGIR.

[20] Brian Kingsbury,et al. The IBM Attila speech recognition toolkit , 2010, 2010 IEEE Spoken Language Technology Workshop.

[21] Georg Heigold,et al. Small-footprint keyword spotting using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Siddika Parlak,et al. Spoken term detection for Turkish Broadcast News , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24] Herbert Gish,et al. Rapid and accurate spoken term detection , 2007, INTERSPEECH.

[25] James R. Glass,et al. Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[26] Ming Yang,et al. DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[27] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[28] Tara N. Sainath,et al. FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .