论文信息 - A hybrid HMM/DNN approach to keyword spotting of short words

A hybrid HMM/DNN approach to keyword spotting of short words

An HMM/DNN framework is proposed to address the issues of short-word detection. The first-stage keyword hypothesizer is redesigned with a context-aware keyword model and a 9state filler model to reduce the miss rate from 80% to 6% and increase the figure-of-merit (FOM) from 6.08% to 21.88% for short words. The hypothesizer is followed by a MLP-based second-stage keyword verifier to further reduce its putative hits. To enhance short word detection, three new techniques, including an HMM-based feature transformation for the MLPs, knowledge-based features, and deep neural networks, are incorporated into redesigning the verifier. With a set of nine short keywords from the TIMIT set the best FOM we had achieved for the proposed KWS system was 42.79%, which is comparable with that of 42.6% for long content words and much better than the FOM of 18.4% for short keywords reported in previous research [10].

I-Fan Chen | Chin-Hui Lee | Chin-Hui Lee | I-Fan Chen

[1] Hiromitsu Nishizaki,et al. Spoken Term Detection Using Multiple Recognition Systems , 2009 .

[2] Bhuvana Ramabhadran,et al. Vocabulary independent spoken term detection , 2007, SIGIR.

[3] Jiazhi Ou,et al. Utterance verification of short keywords using hybrid neural-network/HMM approach , 2001, 2001 International Conferences on Info-Tech and Info-Net. Proceedings (Cat. No.01EX479).

[4] Ellen M. Voorhees,et al. The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[5] Sabato Marco Siniscalchi,et al. Combining speech attribute detection and penalized logistic regression for phoneme recognition , 2012, Neurocomputing.

[6] Geoffrey E. Hinton,et al. Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[7] John Harris,et al. English Sound Structure , 1994 .

[8] I-Fan Chen,et al. A Study on Using Word-Level HMMs to Improve ASR Performance over State-of-the-Art Phone-Level Acoustic Modeling for LVCSR , 2012, INTERSPEECH.

[9] Chin-Hui Lee,et al. Automatic recognition of keywords in unconstrained speech using hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..

[10] Chin-Hui Lee,et al. Speech recognition using weighted HMM and subspace projection approaches , 1994, IEEE Trans. Speech Audio Process..

[11] J. Weijer,et al. Word length, sentence length and frequency: Zipf revisited , 2004 .

[12] I-Fan Chen,et al. Articulatory feature asynchrony analysis and compensation in detection-based ASR , 2009, INTERSPEECH.

[13] Lukás Burget,et al. Comparison of keyword spotting approaches for informal continuous speech , 2005, INTERSPEECH.

[14] Mitchel Weintraub,et al. LVCSR log-likelihood ratio scoring for keyword spotting , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[15] Sadaoki Furui,et al. Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[16] Richard Lippmann,et al. Hybrid neural-network/HMM approaches to wordspotting , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17] Chin-Hui Lee,et al. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[18] Stan Davis,et al. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[19] Andreas Stolcke,et al. The SRI/OGI 2006 spoken term detection system , 2007, INTERSPEECH.

[20] Daniel P. W. Ellis,et al. Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[21] Chin-Hui Lee,et al. A study on word detector design and knowledge-based pruning and rescoring , 2007, INTERSPEECH.

[22] S. Furui,et al. Automatic recognition and understanding of spoken language - a first step toward natural human-machine communication , 2000, Proceedings of the IEEE.

[23] Richard Rose,et al. A hidden Markov model based keyword recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[24] Simon King,et al. Detection of phonological features in continuous speech using neural networks , 2000, Comput. Speech Lang..

[25] Herbert Gish,et al. Reducing word error rate on conversational speech from the Switchboard corpus , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[26] Yee Whye Teh,et al. A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[27] Herbert Gish,et al. Rapid and accurate spoken term detection , 2007, INTERSPEECH.