Improvements on transducing syllable lattice to word lattice for keyword search

This paper investigates a weighted finite state transducer (WFST) based syllable decoding and transduction method for keyword search (KWS), and compares it with sub-word search and phone confusion methods in detail. Acoustic context dependent phone models are trained from word forced alignments and then used for syllable decoding and lattice generation. Out-of-vocabulary (OOV) keyword pronunciations are produced using a grapheme-to-syllable (G2S) system and then used to construct a lexical transducer. The lexical transducer is then composed with a keyword-boosted language model (LM) to transduce the syllable lattices to word lattices for final KWS. Word Error Rates (WER) and KWS results are reported for 5 different languages. It is shown that the syllable transduction method gives comparable KWS results to the syllable search and phone confusion methods. Combination of these three methods further improves OOV KWS performance.

[1]  Olivier Siohan,et al.  Fast vocabulary-independent audio search using path-based graph indexing , 2005, INTERSPEECH.

[2]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[3]  Mari Ostendorf,et al.  Subword-based modeling for handling OOV words inkeyword spotting , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Nelson Morgan,et al.  The TAO of ATWV: Probing the mysteries of keyword search performance , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[5]  Richard M. Schwartz,et al.  Normalizationofphonetic keyword search scores , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Xiaodong Cui,et al.  An empirical study of confusion modeling in keyword search for low resource languages , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[7]  Florian Metze,et al.  Word-based probabilistic phonetic retrieval for low-resource spoken term detection , 2014, INTERSPEECH.

[8]  Owen Kimball,et al.  Subword speech recognition for detection of unseen words , 2012, INTERSPEECH.

[9]  Richard Sproat,et al.  Lattice-Based Search for Spoken Utterance Retrieval , 2004, NAACL.

[10]  Mehryar Mohri,et al.  Speech Recognition with Weighted Finite-State Transducers , 2008 .

[11]  Melissa A. Redford,et al.  The relative perceptual distinctiveness of initial and final consonants in CVC syllables. , 1999, The Journal of the Acoustical Society of America.

[12]  Pak-Chung Ching,et al.  Query expansion using phonetic confusions for Chinese spoken document retrieval , 2000, IRAL '00.

[13]  Aren Jansen,et al.  Low-resource open vocabulary keyword search using point process models , 2014, INTERSPEECH.

[14]  Richard M. Schwartz,et al.  Score normalization and system combination for improved keyword spotting , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[15]  Richard M. Schwartz,et al.  Subword and phonetic search for detecting out-of-vocabulary keywords , 2014, INTERSPEECH.

[16]  Jeffrey C. Lagarias,et al.  Convergence Properties of the Nelder-Mead Simplex Method in Low Dimensions , 1998, SIAM J. Optim..

[17]  Brian Kingsbury,et al.  Efficient spoken term detection using confusion networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Sanjeev Khudanpur,et al.  Using proxies for OOV keywords in the keyword search task , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[19]  Xiaohui Zhang,et al.  Improving deep neural network acoustic models using generalized maxout networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Sanjeev Khudanpur,et al.  A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[22]  Hang Su,et al.  Syllable based keyword search: Transducing syllable lattices to word lattices , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[23]  Jonathan G. Fiscus,et al.  Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[24]  Jean-Luc Gauvain,et al.  Comparing decoding strategies for subword-based keyword spotting in low-resourced languages , 2014, INTERSPEECH.

[25]  Keikichi Hirose,et al.  WFST-Based Grapheme-to-Phoneme Conversion: Open Source tools for Alignment, Model-Building and Decoding , 2012, FSMNLP.