论文信息 - Two-step Judgment Algorithm for Robust Voice Activity Detection Based on Deep Neural Networks

Two-step Judgment Algorithm for Robust Voice Activity Detection Based on Deep Neural Networks

Voice Activity Detection (VAD) is an important front-end process for speech-based applications such as automatic speech recognition (ASR) and speaker diarization. VAD attempts to identify all the segments containing speech in an audio signal. In this paper, a robust VAD system is developed based on deep neural network (DNN) fusion with Combo-SAD. DNN model is an effective supervised approach that can achieve 4% of missed detection rate (Pmiss) at a falsealarm rate (Pfa) of 5%, Combo-SAD is an unsupervised approach designed for noise robust and reported a 5% Pmiss at Pfa of 3%. Combining the advantages of both techniques, this paper attempts to design a 2-step judgment approach. Experimental results on database containing various type of audios show that the overall error rate reaches 13.50%, which indicates the proposed VAD system is robust and effective.

Haonan Wang

[1] Mohammad Hossein Moattar,et al. A simple but efficient real-time Voice Activity Detection algorithm , 2009, 2009 17th European Signal Processing Conference.

[2] Brian Kingsbury,et al. Improvements to the IBM speech activity detection system for the DARPA RATS program , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Joon-Hyuk Chang,et al. Voice activity detection based on conditional MAP criterion incorporating the spectral gradient , 2012, Signal Process..

[4] Nima Mesgarani,et al. Discrimination of speech from nonspeech based on multiscale spectro-temporal Modulations , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[5] Spyridon Matsoukas,et al. Developing a Speech Activity Detection System for the DARPA RATS Program , 2012, INTERSPEECH.

[6] Andreas Stolcke,et al. Multispeaker speech activity detection for the ICSI meeting recorder , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[7] Björn W. Schuller,et al. Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8] George Saon,et al. Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Mark Liberman,et al. Speech activity detection on youtube using deep neural networks , 2013, INTERSPEECH.

[10] John H. L. Hansen,et al. Unsupervised Speech Activity Detection Using Voicing Measures and Perceptual Spectral Flux , 2013, IEEE Signal Processing Letters.

[11] Lie Lu,et al. Content analysis for audio classification and segmentation , 2002, IEEE Trans. Speech Audio Process..

[12] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.