A hybrid HMM/DNN approach to keyword spotting of short words

An HMM/DNN framework is proposed to address the issues of short-word detection. The first-stage keyword hypothesizer is redesigned with a context-aware keyword model and a 9state filler model to reduce the miss rate from 80% to 6% and increase the figure-of-merit (FOM) from 6.08% to 21.88% for short words. The hypothesizer is followed by a MLP-based second-stage keyword verifier to further reduce its putative hits. To enhance short word detection, three new techniques, including an HMM-based feature transformation for the MLPs, knowledge-based features, and deep neural networks, are incorporated into redesigning the verifier. With a set of nine short keywords from the TIMIT set the best FOM we had achieved for the proposed KWS system was 42.79%, which is comparable with that of 42.6% for long content words and much better than the FOM of 18.4% for short keywords reported in previous research [10].

[1]  Hiromitsu Nishizaki,et al.  Spoken Term Detection Using Multiple Recognition Systems , 2009 .

[2]  Bhuvana Ramabhadran,et al.  Vocabulary independent spoken term detection , 2007, SIGIR.

[3]  Jiazhi Ou,et al.  Utterance verification of short keywords using hybrid neural-network/HMM approach , 2001, 2001 International Conferences on Info-Tech and Info-Net. Proceedings (Cat. No.01EX479).

[4]  Ellen M. Voorhees,et al.  The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[5]  Sabato Marco Siniscalchi,et al.  Combining speech attribute detection and penalized logistic regression for phoneme recognition , 2012, Neurocomputing.

[6]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  John Harris,et al.  English Sound Structure , 1994 .

[8]  I-Fan Chen,et al.  A Study on Using Word-Level HMMs to Improve ASR Performance over State-of-the-Art Phone-Level Acoustic Modeling for LVCSR , 2012, INTERSPEECH.

[9]  Chin-Hui Lee,et al.  Automatic recognition of keywords in unconstrained speech using hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..

[10]  Chin-Hui Lee,et al.  Speech recognition using weighted HMM and subspace projection approaches , 1994, IEEE Trans. Speech Audio Process..

[11]  J. Weijer,et al.  Word length, sentence length and frequency: Zipf revisited , 2004 .

[12]  I-Fan Chen,et al.  Articulatory feature asynchrony analysis and compensation in detection-based ASR , 2009, INTERSPEECH.

[13]  Lukás Burget,et al.  Comparison of keyword spotting approaches for informal continuous speech , 2005, INTERSPEECH.

[14]  Mitchel Weintraub,et al.  LVCSR log-likelihood ratio scoring for keyword spotting , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[15]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[16]  Richard Lippmann,et al.  Hybrid neural-network/HMM approaches to wordspotting , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[18]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[19]  Andreas Stolcke,et al.  The SRI/OGI 2006 spoken term detection system , 2007, INTERSPEECH.

[20]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[21]  Chin-Hui Lee,et al.  A study on word detector design and knowledge-based pruning and rescoring , 2007, INTERSPEECH.

[22]  S. Furui,et al.  Automatic recognition and understanding of spoken language - a first step toward natural human-machine communication , 2000, Proceedings of the IEEE.

[23]  Richard Rose,et al.  A hidden Markov model based keyword recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[24]  Simon King,et al.  Detection of phonological features in continuous speech using neural networks , 2000, Comput. Speech Lang..

[25]  Herbert Gish,et al.  Reducing word error rate on conversational speech from the Switchboard corpus , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[26]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[27]  Herbert Gish,et al.  Rapid and accurate spoken term detection , 2007, INTERSPEECH.