The THUEE system for the openKWS14 keyword search evaluation

The OpenKWS14 keyword search evaluation is one of the most challenging and influential evaluations in the field of speech recognition. Its goal is to build a high-performance keyword search system for a minority language with limited training data in a short period of time. We present the system of the Department of Electronic Engineering, Tsinghua University (THUEE team) for the OpenKWS14 keyword search evaluation. The highlights of the system include the use of convolutional maxout neural networks for acoustic modeling and the use of neural network language models for one-pass lattice generation. The final system is a fusion of 8 sub-systems. The system has achieved an actual term weighted value (ATWV) of 0.5107 for the full language pack (FullLP) condition in the evaluation, ranking third among the participating teams.

[1]  Jonathan G. Fiscus,et al.  Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[2]  Xiaodong Cui,et al.  Improving deep neural network acoustic modeling for audio corpus indexing under the IARPA babel program , 2014, INTERSPEECH.

[3]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[4]  Sanjeev Khudanpur,et al.  Using proxies for OOV keywords in the keyword search task , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[5]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[6]  Meng Cai,et al.  Stochastic pooling maxout networks for low-resource speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Meng Cai,et al.  Convolutional maxout neural networks for low-resource speech recognition , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[8]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Kai Feng,et al.  The subspace Gaussian mixture model - A structured model for speech recognition , 2011, Comput. Speech Lang..

[10]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[11]  SHAN Yu-Xiang,et al.  Fast Language Model Lookahead Algorithm Using Extended N-gram Model , 2012 .

[12]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Sanjeev Khudanpur,et al.  A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Florian Metze,et al.  Deep maxout networks for low-resource speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[15]  Richard M. Schwartz,et al.  The 2013 BBN Vietnamese telephone speech keyword spotting system , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  单煜翔,et al.  基于扩展 N 元文法模型的快速语言模型预测算法 , 2012 .

[17]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[18]  Meng Cai,et al.  Deep maxout neural networks for speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[19]  Jan Cernocký,et al.  BUT 2014 Babel system: analysis of adaptation in NN based systems , 2014, INTERSPEECH.

[20]  Meng Cai,et al.  Efficient One-Pass Decoding with NNLM for Speech Recognition , 2014, IEEE Signal Processing Letters.

[21]  Bin Ma,et al.  Strategies for Vietnamese keyword search , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[23]  Richard M. Schwartz,et al.  Score normalization and system combination for improved keyword spotting , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[24]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[26]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[27]  Xiaodong Cui,et al.  System combination and score normalization for spoken term detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[29]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.