Using online model comparison in the Variational Bayes framework for online unsupervised Voice Activity Detection

This paper presents the use of online Variational Bayes method for online Voice Activity Detection (VAD) in an unsupervised context. In conventional VAD, the final step often relies on state machines whose parameters are heuristically tuned. The goal of this study is to propose a solid statistical scheme for VAD using online model comparison which is provided from the Variational Bayes framework. In this scheme, two models are estimated online in parallel: one for the noise-only situation, and the other for the noise-plus-signal situation The VAD decision is done automatically depending on the selected model. An experimental evaluation on the CENSREC-1-C database shows a significant improvement by the proposed method compared to conventional statistical VAD methods.

[1]  Tatsuya Kawahara,et al.  Using variational bayes free energy for unsupervised voice activity detection , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Brian Kingsbury,et al.  Robust speech recognition in Noisy Environments: The 2001 IBM spine evaluation system , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Eric Moulines,et al.  On‐line expectation–maximization algorithm for latent data models , 2007, ArXiv.

[4]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[5]  Masa-aki Sato,et al.  Online Model Selection Based on the Variational Bayes , 2001, Neural Computation.

[6]  David R. Cox,et al.  PRINCIPLES OF STATISTICAL INFERENCE , 2017 .

[7]  M.N.S. Swamy,et al.  An improved voice activity detection using higher order statistics , 2005, IEEE Transactions on Speech and Audio Processing.

[8]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[9]  Matthew J. Beal,et al.  The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures , 2003 .

[10]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[11]  Izhak Shafran,et al.  Robust speech detection and segmentation for real-time ASR applications , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..