A Novel Discriminative Score Calibration Method for Keyword Search

The performance of keyword search systems depends heavily on the quality of confidence scores. In this work, a novel discriminative score calibration method has been proposed. By training an MLP classifier employing the word posterior probability and several novel normalized scores, we can obtain a relative improvement of 4.67% for the actual term-weighted value (ATWV) metric on the OpenKWS15 development test dataset. In addition, a LSTM-CTC based keyword verification method has been proposed to supply extra acoustic information. After the information is added, a further improvement of 7.05% over the baseline can be observed.

[1]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Kai Feng,et al.  The subspace Gaussian mixture model - A structured model for speech recognition , 2011, Comput. Speech Lang..

[3]  Mari Ostendorf,et al.  Compensating for Word Posterior Estimation Bias in Confusion Networks , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[4]  Andrew Rosenberg,et al.  Using word burst analysis to rescore keyword search candidates on low-resource languages , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Mark J. F. Gales,et al.  Investigation of multilingual deep neural networks for spoken term detection , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[6]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[7]  Xiaodong Cui,et al.  System combination and score normalization for spoken term detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Yu Zhang,et al.  Graph-based re-ranking using acoustic feature similarity between search results for spoken term detection on low-resource languages , 2014, INTERSPEECH.

[9]  Julia Hirschberg,et al.  Rescoring Confusion Networks for Keyword Search , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Meng Cai,et al.  Calibration of word posterior estimation in confusion networks for keyword search , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[11]  Navdeep Jaitly,et al.  Vocal Tract Length Perturbation (VTLP) improves speech recognition , 2013 .

[12]  Lin-Shan Lee,et al.  Improved spoken term detection with graph-based re-ranking in feature space , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[14]  Dong Wang,et al.  Augmented set of features for confidence estimation in spoken term detection , 2010, INTERSPEECH.

[15]  Haizhou Li,et al.  Discriminative score normalization for keyword search decision , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Richard M. Schwartz,et al.  Score normalization and system combination for improved keyword spotting , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[17]  Richard M. Schwartz,et al.  White Listing and Score Normalization for Keyword Spotting of Noisy Speech , 2012, INTERSPEECH.

[18]  Lin-Shan Lee,et al.  Enhanced Spoken Term Detection Using Support Vector Machines and Weighted Pseudo Examples , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Meng Cai,et al.  High-performance Swahili keyword search with very limited language pack: The THUEE system for the OpenKWS15 evaluation , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[20]  Meng Cai,et al.  Convolutional maxout neural networks for low-resource speech recognition , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[21]  Herbert Gish,et al.  Rapid and accurate spoken term detection , 2007, INTERSPEECH.

[22]  Gunnar Evermann,et al.  Large vocabulary decoding and confidence estimation using word posterior probabilities , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[23]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[24]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[25]  Tetsuya Takiguchi,et al.  Two-step correction of speech recognition errors based on n-gram and long contextual information , 2013, INTERSPEECH.