Combination of FST and CN search in spoken term detection

Spoken Term Detection (STD) focuses on finding instances of a particular spoken word or phrase in an audio corpus. Most STD systems have a two-step pipeline, ASR followed by search. Two approaches to search are common, Confusion Network (CN) based search and Finite State Transducer (FST) based search. In this paper, we examine combination of these two different search approaches, using the same ASR output. We find that the CN search performs better on shorter queries, and FST search performs better on longer queries. By combining the different search results from the same ASR decoding, we achieve better performance compared to either search approach on its own. We also find that this improvement is additive to the usual combination of decoder results using different modeling techniques.

[1]  Xiaohui Zhang,et al.  Improving deep neural network acoustic models using generalized maxout networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[3]  Florian Metze,et al.  Improving language-universal feature extraction with deep maxout and convolutional neural networks , 2014, INTERSPEECH.

[4]  Murat Saraclar,et al.  Lattice Indexing for Spoken Term Detection , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Fuchun Peng,et al.  Search results based N-best hypothesis rescoring with maximum entropy classification , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[6]  Florian Metze,et al.  Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training , 2013, INTERSPEECH.

[7]  David Yarowsky,et al.  Quantifying the value of pronunciation lexicons for keyword search in lowresource languages , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Herbert Gish,et al.  Rapid and accurate spoken term detection , 2007, INTERSPEECH.

[9]  Andreas Stolcke,et al.  The SRI/OGI 2006 spoken term detection system , 2007, INTERSPEECH.

[10]  Richard M. Schwartz,et al.  White Listing and Score Normalization for Keyword Spotting of Noisy Speech , 2012, INTERSPEECH.

[11]  Owen Kimball,et al.  Detection of unseen words in conversational Mandarin , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Richard M. Schwartz,et al.  Semi-Supervised Methods for Improving Keyword Search of Unseen Terms , 2012, INTERSPEECH.

[13]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[14]  Bhuvana Ramabhadran,et al.  Vocabulary independent spoken term detection , 2007, SIGIR.

[15]  Xiaodong Cui,et al.  System combination and score normalization for spoken term detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Florian Metze,et al.  Deep maxout networks for low-resource speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[17]  Julia Hirschberg,et al.  Rescoring Confusion Networks for Keyword Search , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Florian Metze,et al.  Distributed learning of multilingual DNN feature extractors using GPUs , 2014, INTERSPEECH.

[19]  Jan Cernocký,et al.  Improved feature processing for deep neural networks , 2013, INTERSPEECH.

[20]  Florian Metze,et al.  Towards speaker adaptive training of deep neural network acoustic models , 2014, INTERSPEECH.

[21]  Ngoc Thang Vu,et al.  Generating exact lattices in the WFST framework , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Jonathan G. Fiscus,et al.  Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[23]  Sanjeev Khudanpur,et al.  Using proxies for OOV keywords in the keyword search task , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[24]  Alexander I. Rudnicky,et al.  Using conversational word bursts in spoken term detection , 2013, INTERSPEECH.

[25]  Haihua Xu,et al.  Minimum Bayes Risk decoding and system combination based on a recursion for edit distance , 2011, Comput. Speech Lang..

[26]  Owen Kimball,et al.  Subword speech recognition for detection of unseen words , 2012, INTERSPEECH.

[27]  Richard M. Schwartz,et al.  Score normalization and system combination for improved keyword spotting , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[28]  Brian Kingsbury,et al.  Exploiting diversity for spoken term detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Andreas Stolcke,et al.  Finding consensus among words: lattice-based word error minimization , 1999, EUROSPEECH.

[30]  Sanjeev Khudanpur,et al.  Can You Repeat That? Using Word Repetition to Improve Spoken Term Detection , 2014, ACL.