A keyword-boosted sMBR criterion to enhance keyword search performance in deep neural network based acoustic modeling

We propose a keyword-boosted state-level minimum Bayes risk (sMBR) criterion for training DNN-HMM hybrid keyword search systems by enhancing acoustic detail of a given list of target keyword terms. The rationale behind the proposed discriminative training strategy is to place more acoustic modeling emphasis on states appearing in the given keywords. We observed a relative gain of 1.7 ~ 6.1% in actual term weighted value (ATWV) performance with the proposed keyword-boosted sMBR training over the conventional sMBR systems when tested on the IARPA Babel program's Vietnamese limited-language-pack task. A detailed result analysis suggests that the proposed sMBR objective function effectively improves the ATWV scores by boosting the probability of detecting keywords appearing in the system output with an increased correct and insertion rates in the decoded lattices.

[1]  Chin-Hui Lee,et al.  Automatic recognition of keywords in unconstrained speech using hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..

[2]  Thomas Hain,et al.  Hypothesis spaces for minimum Bayes risk training in large vocabulary speech recognition , 2006, INTERSPEECH.

[3]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Richard M. Schwartz,et al.  Score normalization and system combination for improved keyword spotting , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[5]  I-Fan Chen,et al.  A novel keyword+LVCSR-filler based grammar network representation for spoken keyword search , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[6]  Nelson Morgan,et al.  The TAO of ATWV: Probing the mysteries of keyword search performance , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[7]  Brian Kingsbury,et al.  Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Van Hai Do,et al.  A study on LVCSR and keyword search for tagalog , 2013, INTERSPEECH.

[11]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[12]  Jonathan G. Fiscus,et al.  Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[13]  Biing-Hwang Juang,et al.  Minimum classification error rate methods for speech recognition , 1997, IEEE Trans. Speech Audio Process..

[14]  Xiaodong Cui,et al.  A high-performance Cantonese keyword search system , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[17]  Biing-Hwang Juang,et al.  Adaptive boosted non-uniform mce for keyword spotting on spontaneous speech , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Bin Ma,et al.  Strategies for Vietnamese keyword search , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).