System request detection in human conversation based on multi-resolution Gabor wavelet features

For a hands-free speech interface, it is important to detect commands in spontaneous utterances. Usual voice activity detection systems can only distinguish speech frames from nonspeech frames, but they cannot discriminate whether the detected speech section is a command for a system or not. In this paper, in order to analyze the difference between system requests and spontaneous utterances, we focus on fluctuations in a long period, such as prosodic articulation, and fluctuations in a short period, such as phoneme articulation. The use of multi-resolution analysis using Gabor wavelet on a Log-scale Mel-frequency Filter-bank clarifies the different characteristics of system commands and spontaneous utterances. Experiments using our robot dialog corpus show that the accuracy of the proposed method is 92.6% in F-measure, while the conventional power and prosody-based method is just 66.7%. Index Terms: dialog system, voice activity detection, system request detection

[1]  Tetsunori Kobayashi,et al.  Speech spotter: on-demand speech recognition in human-human conversation on the telephone or in face-to-face situations , 2004, INTERSPEECH.

[2]  Nelson Morgan,et al.  Multi-stream spectro-temporal features for robust speech recognition , 2008, INTERSPEECH.

[3]  Tetsuya Takiguchi,et al.  System Request Utterance Detection Based on Acoustic and Linguistic Features , 2008 .

[4]  Birger Kollmeier,et al.  Optimization and evaluation of Gabor feature sets for ASR , 2008, INTERSPEECH.

[5]  Zekeriya Tufekci,et al.  Mel-scaled discrete wavelet coefficients for speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[6]  Herbert Gish,et al.  Spotting events in continuous speech , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Tetsuya Takiguchi,et al.  System request detection in conversation based on acoustic and speaker alternation features , 2007, INTERSPEECH.

[8]  Chin-Hui Lee,et al.  Speaking-style dependent lexicalized filler model for key-phrase detection and verification , 1997, ICSLP.

[9]  Tatsuya Kawahara,et al.  Using variational bayes free energy for unsupervised voice activity detection , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Hwa Jeon Song,et al.  Speech/Music Discrimination for Robust Speech Recognition in Robots , 2007, RO-MAN 2007 - The 16th IEEE International Symposium on Robot and Human Interactive Communication.

[11]  Masahito Togami,et al.  Intentional voice command detection for completely hands-free speech interface in home environments , 2008, INTERSPEECH.

[12]  Satoshi Nakamura,et al.  Development of VAD evaluation framework CENSREC-1-C and investigation of relationship between VAD and speech recognition performance , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[13]  Biing-Hwang Juang,et al.  Discriminative utterance verification using minimum string verification error (MSVE) training , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[14]  Javier Ramírez,et al.  An effective subband OSF-based VAD with noise reduction for robust speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[15]  Kenji Araki,et al.  Linguistic and acoustic features depending on different situations - the experiments considering speech recognition rate , 2005, INTERSPEECH.

[16]  Kiyohiro Shikano,et al.  Voice activity detection applied to hands-free spoken dialogue robot based on decoding using acoustic and language model , 2007, ROBOCOMM.