Joint decoding of tandem and hybrid systems for improved keyword spotting on low resource languages

Copyright © 2015 ISCA. Keyword spotting (KWS) for low-resource languages has drawn increasing attention in recent years. The state-of-the-art KWS systems are based on lattices or Confusion Networks (CN) generated by Automatic Speech Recognition (ASR) systems. It has been shown that considerable KWS gains can be obtained by combining the keyword detection results from different forms of ASR systems, e.g., Tandem and Hybrid systems. This paper investigates an alternative combination scheme for KWS using joint decoding. This scheme treats a Tandem system and a Hybrid system as two separate streams, and makes a linear combination of individual acoustic model log-likelihoods. Joint decoding is more efficient as it requires just a single pass of decoding and a single pass of keyword search. Experiments on six Babel OP2 development languages show that joint decoding is capable of providing consistent gains over each individual system. Moreover, it is possible to efficiently rescore the joint decoding lattices with Tandem or Hybrid acoustic models, and further KWS gains can be obtained by merging the detection posting lists from the joint decoding lattices and rescored lattices.

[1]  Hervé Bourlard,et al.  New entropy based combination rules in HMM/ANN multi-stream ASR , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[2]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[3]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[4]  Mark J. F. Gales,et al.  Combining tandem and hybrid systems for improved speech recognition and keyword spotting on low resource languages , 2014, INTERSPEECH.

[5]  Andreas Stolcke,et al.  The SRI/OGI 2006 spoken term detection system , 2007, INTERSPEECH.

[6]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[7]  Hermann Ney,et al.  Multilingual MRASTA features for low-resource keyword search and speech recognition systems , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Jonathan G. Fiscus,et al.  Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[9]  Herbert Gish,et al.  Rapid and accurate spoken term detection , 2007, INTERSPEECH.

[10]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[11]  Richard M. Schwartz,et al.  Discriminative semi-supervised training for keyword search in low resource languages , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[12]  Bin Ma,et al.  Low-resource keyword search strategies for tamil , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Mark J. F. Gales,et al.  The efficient incorporation of MLP features into automatic speech recognition systems , 2011, Comput. Speech Lang..

[14]  Tara N. Sainath,et al.  Joint training of convolutional and non-convolutional neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Cyril Allauzen,et al.  General Indexation of Weighted Automata - Application to Spoken Utterance Retrieval , 2004, HLT-NAACL 2004.

[16]  George Saon,et al.  Feature space Gaussianization , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Gunnar Evermann,et al.  Posterior probability decoding, confidence estimation and system combination , 2000 .

[18]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[19]  Mark J. F. Gales,et al.  Language independent and unsupervised acoustic models for speech recognition and keyword spotting , 2014, INTERSPEECH.

[20]  Xavier L. Aubert,et al.  Combining TDNN and HMM in a hybrid system for improved continuous-speech recognition , 1994, IEEE Trans. Speech Audio Process..

[21]  Steve Renals,et al.  Revisiting hybrid and GMM-HMM system combination techniques , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Brian Kingsbury,et al.  Exploiting diversity for spoken term detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[24]  Jean-Luc Gauvain,et al.  Lightly supervised and unsupervised acoustic model training , 2002, Comput. Speech Lang..

[25]  Mark J. F. Gales,et al.  Data augmentation for low resource languages , 2014, INTERSPEECH.

[26]  Chao Zhang,et al.  A general artificial neural network extension for HTK , 2015, INTERSPEECH.

[27]  Ralf Schlüter,et al.  Investigation on cross- and multilingual MLP features under matched and mismatched acoustical conditions , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Kenneth Ward Church,et al.  Deep neural network features and semi-supervised training for low resource speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Hervé Glotin,et al.  Multi-stream adaptive evidence combination for noise robust ASR , 2001, Speech Commun..

[31]  Kristina Toutanova,et al.  Joint Optimization for Machine Translation System Combination , 2009, EMNLP.

[32]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[33]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[35]  Richard M. Schwartz,et al.  Score normalization and system combination for improved keyword spotting , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[36]  Hervé Bourlard,et al.  Connectionist probability estimators in HMM speech recognition , 1994, IEEE Trans. Speech Audio Process..

[37]  Xiaodong Cui,et al.  A high-performance Cantonese keyword search system , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[38]  Xiaodong Cui,et al.  System combination and score normalization for spoken term detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[39]  Lukás Burget,et al.  Sub-word modeling of out of vocabulary words in spoken term detection , 2008, 2008 IEEE Spoken Language Technology Workshop.

[40]  Mark J. F. Gales,et al.  Unicode-based graphemic systems for limited resource languages , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..