Multi-task joint-learning for robust voice activity detection

Model based VAD approaches have been widely used and achieved success in practice. These approaches usually cast VAD as a frame-level classification problem and employ statistical classifiers, such as Gaussian Mixture Model (GMM) or Deep Neural Network (DNN) to assign a speech/silence label for each frame. Due to the frame independent assumption classification, the VAD results tend to be fragile. To address this problem, in this paper, a new structured multi-frame prediction DNN approach is proposed to improve the segment-level VAD performance. During DNN training, VAD labels of multiple consecutive frames are concatenated together as targets and jointly trained with a speech enhancement task to achieve robustness under noisy conditions. During testing, the VAD label for each frame is obtained by merging the prediction results from neighbouring frames. Experiments on an Aurora 4 dataset showed that, conventional DNN based VAD has poor and unstable prediction performance while the proposed multitask trained VAD is much more robust.

[1]  Tuan Van Pham,et al.  Using Artificial Neural Network for Robust Voice Activity Detection Under Adverse Conditions , 2009, 2009 IEEE-RIVF International Conference on Computing and Communication Technologies.

[2]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[3]  K. Bullington,et al.  Engineering aspects of TASI , 1959, Transactions of the American Institute of Electrical Engineers, Part I: Communication and Electronics.

[4]  Dongho Kim,et al.  Continuous asr for flexible incremental dialogue , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  James P. Egan,et al.  Signal detection theory and ROC analysis , 1975 .

[6]  Jun Du,et al.  A universal VAD based on jointly trained deep neural networks , 2015, INTERSPEECH.

[7]  Ekaterina Egorova,et al.  Multi-Task Neural Networks for Speech Recognition , 2014 .

[8]  Nima Mesgarani,et al.  Discrimination of speech from nonspeech based on multiscale spectro-temporal Modulations , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Mark Liberman,et al.  Speech activity detection on youtube using deep neural networks , 2013, INTERSPEECH.

[10]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[11]  Jean-Claude Junqua,et al.  A study of endpoint detection algorithms in adverse conditions: incidence on a DTW and HMM recognizer , 1991, EUROSPEECH.

[12]  Dirk P. Kroese,et al.  The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Machine Learning , 2004 .

[13]  Spyridon Matsoukas,et al.  Developing a Speech Activity Detection System for the DARPA RATS Program , 2012, INTERSPEECH.

[14]  Geoffrey E. Hinton,et al.  Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models , 2014, INTERSPEECH.

[15]  Peter Glöckner,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2013 .

[16]  DeLiang Wang,et al.  Boosted deep neural networks and multi-resolution cochleagram features for voice activity detection , 2014, INTERSPEECH.

[17]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Kai Yu,et al.  Evaluating vad for automatic speech recognition , 2014, 2014 12th International Conference on Signal Processing (ICSP).

[19]  Thad Hughes,et al.  Recurrent neural networks for voice activity detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Javier Ramírez,et al.  Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..

[21]  Chungyong Lee,et al.  Robust voice activity detection algorithm for estimating noise spectrum , 2000 .