Constructing Acoustic Distances Between Subwords and States Obtained from a Deep Neural Network for Spoken Term Detection

The detection of out-of-vocabulary (OOV) query terms is a crucial problem in spoken term detection (STD), because OOV query terms are likely. To enable search of OOV query terms in STD systems, a query subword sequence is compared with subword sequences generated using an automatic speech recognizer against spoken documents. When comparing two subword sequences, the edit distance is a typical distance between any two subwords. We previously proposed an acoustic distance defined from statistics between states of the hidden Markov model (HMM) and showed its effectiveness in STD [4]. This paper proposes an acoustic distance between subwords and HMM states where the posterior probabilities output by a deep neural network are used to improve the STD accuracy for OOV query terms. Experiments are conducted to evaluate the performance of the proposed method, using the open test collections for the “Spoken&Doc” tasks of the NTCIR-9 [13] and NTCIR-10 [14] workshops. The proposed method shows improvements in mean average precision.

[1]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[2]  Brian Kingsbury,et al.  Exploiting diversity for spoken term detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Florian Metze,et al.  EM-based phoneme confusion matrix generation for low-resource spoken term detection , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[4]  Stephen J. Cox,et al.  Modelling confusion matrices to improve speech recognition accuracy, with an application to dysarthric speech , 2007, INTERSPEECH.

[5]  Jonathan G. Fiscus,et al.  Automatic Language Model Adaptation for Spoken Document Retrieval , 2000, RIAO.

[6]  Yoshiaki Itoh,et al.  An STD System for OOV Query Terms Integrating Multiple STD Results of Various Subword units , 2013, NTCIR.

[7]  Hiromitsu Nishizaki,et al.  Spoken Term Detection Using Multiple Speech Recognizers' Outputs at NTCIR-9 SpokenDoc STD subtask , 2011, NTCIR.

[8]  Yonghong Yan,et al.  Keyword Spotting Based on Phoneme Confusion Matrix , 2006 .

[9]  Tatsuya Kawahara,et al.  Overview of the IR for Spoken Documents Task in NTCIR-9 Workshop , 2011, NTCIR.

[10]  Tatsuya Kawahara,et al.  Overview of the NTCIR-10 SpokenDoc-2 Task , 2013, NTCIR.

[11]  Katunobu Itou,et al.  Evaluating Speech-Driven IR in the NTCIR-3 Web Retrieval Task , 2002, NTCIR.

[12]  Tatsuya Kawahara,et al.  Recent Development of Open-Source Speech Recognition Engine Julius , 2009 .

[13]  Nobuaki Minematsu,et al.  Divergence estimation based on deep neural networks and its use for language identification , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Xiaodong Cui,et al.  A high-performance Cantonese keyword search system , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[16]  Fabio Valente,et al.  English spoken term detection in multilingual recordings , 2010, INTERSPEECH.

[17]  Shi-wook Lee,et al.  Open-vocabulary spoken document retrieval based on new subword models and subword phonetic similarity , 2006, INTERSPEECH.

[18]  K. Maekawa CORPUS OF SPONTANEOUS JAPANESE : ITS DESIGN AND EVALUATION , 2003 .

[19]  Thomas Sikora,et al.  Phonetic confusion based document expansion for spoken document retrieval , 2004, INTERSPEECH.