A resource-dependent approach to word modeling for keyword spotting

A hierarchical framework is proposed to address the issues of modeling different type of words in keyword spotting (KWS). Keyword models are built at various levels according to the availability of training set resources for each individual word. The proposed approach improves the performance of KWS even when no training speech is available for the keywords. It also suggests an easier way to collect training data for these resource-limited words. Experimental results show that the proposed framework improves performance in KWS in a figure-of-merit (FOM) metric regardless of the number of training instances for each keyword. For words with abundant speech data, the proposed method exploits the training data better than the conventional modeling technique and boosts the system FOM from 9.79% to 42.78%. For words with a small amount of training data, the new method increases the system FOM from 29.05% to 49.06%. Even for keywords without any training examples, the new modeling scheme improves the system FOM from 60.96% to 66.51%.

[1]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[2]  Jiazhi Ou,et al.  Utterance verification of short keywords using hybrid neural-network/HMM approach , 2001, 2001 International Conferences on Info-Tech and Info-Net. Proceedings (Cat. No.01EX479).

[3]  Ellen M. Voorhees,et al.  The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[4]  Richard Lippmann,et al.  Hybrid neural-network/HMM approaches to wordspotting , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Chin-Hui Lee,et al.  Toward a detector-based universal phone recognizer , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Biing-Hwang Juang,et al.  Discriminative utterance verification for connected digits recognition , 1995, IEEE Trans. Speech Audio Process..

[7]  Herbert Gish,et al.  Rapid and accurate spoken term detection , 2007, INTERSPEECH.

[8]  Chin-Hui Lee,et al.  Automatic recognition of keywords in unconstrained speech using hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..

[9]  Sridha Sridharan,et al.  A phonetic search approach to the 2006 NIST spoken term detection evaluation , 2007, INTERSPEECH.

[10]  Ramesh A. Gopinath,et al.  Low-Resource Speech Recognition of 500-Word Vocabularies , 2001 .

[11]  I-Fan Chen,et al.  A Study on Using Word-Level HMMs to Improve ASR Performance over State-of-the-Art Phone-Level Acoustic Modeling for LVCSR , 2012, INTERSPEECH.

[12]  Chin-Hui Lee,et al.  A phonetic feature based lattice rescoring approach to LVCSR , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[14]  Andreas Stolcke,et al.  The SRI/OGI 2006 spoken term detection system , 2007, INTERSPEECH.

[15]  P. Ladefoged,et al.  The sounds of the world's languages , 1996 .

[16]  Chin-Hui Lee,et al.  Towards bottom-up continuous phone recognition , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[17]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[18]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[19]  S. Furui,et al.  Automatic recognition and understanding of spoken language - a first step toward natural human-machine communication , 2000, Proceedings of the IEEE.

[20]  Richard Rose,et al.  A hidden Markov model based keyword recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[21]  Bhuvana Ramabhadran,et al.  Vocabulary independent spoken term detection , 2007, SIGIR.

[22]  G. Nathan The sounds of the world's languages By Peter Ladefoged and Ian Maddieson (review) , 2015 .

[23]  Mitchel Weintraub,et al.  LVCSR log-likelihood ratio scoring for keyword spotting , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[24]  Chin-Hui Lee,et al.  Speech recognition using weighted HMM and subspace projection approaches , 1994, IEEE Trans. Speech Audio Process..