Two-step Judgment Algorithm for Robust Voice Activity Detection Based on Deep Neural Networks

Voice Activity Detection (VAD) is an important front-end process for speech-based applications such as automatic speech recognition (ASR) and speaker diarization. VAD attempts to identify all the segments containing speech in an audio signal. In this paper, a robust VAD system is developed based on deep neural network (DNN) fusion with Combo-SAD. DNN model is an effective supervised approach that can achieve 4% of missed detection rate (Pmiss) at a falsealarm rate (Pfa) of 5%, Combo-SAD is an unsupervised approach designed for noise robust and reported a 5% Pmiss at Pfa of 3%. Combining the advantages of both techniques, this paper attempts to design a 2-step judgment approach. Experimental results on database containing various type of audios show that the overall error rate reaches 13.50%, which indicates the proposed VAD system is robust and effective.

[1]  Mohammad Hossein Moattar,et al.  A simple but efficient real-time Voice Activity Detection algorithm , 2009, 2009 17th European Signal Processing Conference.

[2]  Brian Kingsbury,et al.  Improvements to the IBM speech activity detection system for the DARPA RATS program , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Joon-Hyuk Chang,et al.  Voice activity detection based on conditional MAP criterion incorporating the spectral gradient , 2012, Signal Process..

[4]  Nima Mesgarani,et al.  Discrimination of speech from nonspeech based on multiscale spectro-temporal Modulations , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Spyridon Matsoukas,et al.  Developing a Speech Activity Detection System for the DARPA RATS Program , 2012, INTERSPEECH.

[6]  Andreas Stolcke,et al.  Multispeaker speech activity detection for the ICSI meeting recorder , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[7]  Björn W. Schuller,et al.  Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  George Saon,et al.  Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Mark Liberman,et al.  Speech activity detection on youtube using deep neural networks , 2013, INTERSPEECH.

[10]  John H. L. Hansen,et al.  Unsupervised Speech Activity Detection Using Voicing Measures and Perceptual Spectral Flux , 2013, IEEE Signal Processing Letters.

[11]  Lie Lu,et al.  Content analysis for audio classification and segmentation , 2002, IEEE Trans. Speech Audio Process..

[12]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.