Rescoring by a deep neural network for spoken term detection

In spoken-term detection (STD), the detection of out-of-vocabulary (OOV) query terms is crucial because query terms are likely to be OOV terms. This paper proposes a rescoring method that uses the posterior probabilities output by a deep neural network (DNN) to improve detection accuracy for OOV query terms. Conventional STD methods for OOV query terms search a query subword sequence for subword sequences of speech data by using an automatic speech recognizer. A detailed matching in the proposed method is performed by using the probabilities output by the DNN. A pseudo query at the frame or state level is generated so as to align the obtained probability at the frame level. To reduce the computational burden on the DNN, we apply the proposed method to only top candidate utterances, which can be quickly found by a conventional STD method. Experiments were conducted to evaluate the performance of the proposed method, using the open test collections for the SpokenDoc tasks of the NTCIR-9 and NTCIR-10 workshops as benchmarks. The proposed method improved the mean average precision between 5 and 20 points, surpassing the best accuracy obtained at the workshops. These results demonstrated the effectiveness of the proposed method.

[1]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[2]  Katunobu Itou,et al.  Evaluating Speech-Driven IR in the NTCIR-3 Web Retrieval Task , 2002, NTCIR.

[3]  Tatsuya Kawahara,et al.  Overview of the IR for Spoken Documents Task in NTCIR-9 Workshop , 2011, NTCIR.

[4]  Sridha Sridharan,et al.  A phonetic search approach to the 2006 NIST spoken term detection evaluation , 2007, INTERSPEECH.

[5]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[6]  Jonathan G. Fiscus,et al.  Automatic Language Model Adaptation for Spoken Document Retrieval , 2000, RIAO.

[7]  Fabio Valente,et al.  English spoken term detection in multilingual recordings , 2010, INTERSPEECH.

[8]  Hiromitsu Nishizaki,et al.  Spoken Term Detection Using Multiple Speech Recognizers' Outputs at NTCIR-9 SpokenDoc STD subtask , 2011, NTCIR.

[9]  Yoshiaki Itoh,et al.  An STD System for OOV Query Terms Integrating Multiple STD Results of Various Subword units , 2013, NTCIR.

[10]  Jong Kyoung Kim,et al.  Speech recognition , 1983, 1983 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[11]  Tatsuya Kawahara,et al.  Recent Development of Open-Source Speech Recognition Engine Julius , 2009 .

[12]  Xiaodong Cui,et al.  A high-performance Cantonese keyword search system , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  K. Maekawa CORPUS OF SPONTANEOUS JAPANESE : ITS DESIGN AND EVALUATION , 2003 .

[14]  Tatsuya Kawahara,et al.  Overview of the NTCIR-10 SpokenDoc-2 Task , 2013, NTCIR.

[15]  Brian Kingsbury,et al.  Exploiting diversity for spoken term detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Shi-wook Lee,et al.  Open-vocabulary spoken document retrieval based on new subword models and subword phonetic similarity , 2006, INTERSPEECH.