Weighted Finite State Transducer-Based Endpoint Detection Using Probabilistic Decision Logic

In this paper, we propose the use of data-driven probabilistic utterance-level decision logic to improve Weighted Finite State Transducer (WFST)-based endpoint detection. In general, endpoint detection is dealt with using two cascaded decision processes. The first process is frame-level speech/non-speech classification based on statistical hypothesis testing, and the second process is a heuristic-knowledge-based utterance-level speech boundary decision. To handle these two processes within a unified framework, we propose a WFST-based approach. However, a WFST-based approach has the same limitations as conventional approaches in that the utterance-level decision is based on heuristic knowledge and the decision parameters are tuned sequentially. Therefore, to obtain decision knowledge from a speech corpus and optimize the parameters at the same time, we propose the use of data-driven probabilistic utterance-level decision logic. The proposed method reduces the average detection failure rate by about 14% for various noisy-speech corpora collected for an endpoint detection evaluation.

[1]  Joon-Hyuk Chang,et al.  Statistical model-based voice activity detection using support vector machine , 2009 .

[2]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[3]  Masakiyo Fujimoto,et al.  A voice activity detection based on the adaptive integration of multiple speech features and a signal decision scheme , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Yunkeun Lee,et al.  Endpoint detection using weighted finite state transducer , 2013, INTERSPEECH.

[5]  Javier Ramírez,et al.  Statistical voice activity detection using a multiple observation likelihood ratio test , 2005, IEEE Signal Processing Letters.

[6]  Dong Enqing,et al.  Applying support vector machines to voice activity detection , 2002, 6th International Conference on Signal Processing, 2002..

[7]  Yunkeun Lee,et al.  Intra- and Inter-frame Features for Automatic Speech Recognition , 2014 .

[8]  Thad Hughes,et al.  Recurrent neural networks for voice activity detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Chiyoun Park,et al.  Integration of sporadic noise model in POMDP-based voice activity detection , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Masafumi Nishimura,et al.  Long-Term Spectro-Temporal and Static Harmonic Features for Voice Activity Detection , 2010, IEEE Journal of Selected Topics in Signal Processing.

[11]  Fernando Pereira,et al.  Weighted Automata in Text and Speech Processing , 2005, ArXiv.

[12]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.