Unsupervised and supervised VAD systems using combination of time and frequency domain features

Abstract Voice Activity Detection (VAD), also referred as Speech Activity Detection (SAD) is the process of identifying speech/non-speech region in digital speech recordings. It is used as a preliminary stage to reduce errors and increase effectiveness in the most of speech based applications like automatic speech recognition (ASR), speaker identification/verification, speech enhancement, speaker diarization etc. In this study, two independent VAD structures were proposed for unsupervised and supervised approaches using both time and frequency domain features. The autocorrelation based pitch contour estimation was used together with the 1NN Cosine classifier trained by 21-column feature matrix comprising Energy, Zero Crossing Rate (ZCR), 13rd order-Mel Frequency Cepstral Coefficients (MFCC) and Shannon Entropies of daubechies-filtered 5th depth-Wavelet Packet Transform (WPT) to obtain VAD decision in supervised approach, while methods like normalization, thresholding and median filtering were applied over the same feature set in unsupervised approach. The proposed unsupervised VAD achieved error rates of 4%, 19%, 0.02% and 0.7% for the FEC, MSC, OVER and NDS, respectively at 0 dB SNR. The VAD decisions of both supervised and unsupervised systems showed that the proposed methods can efficiently be used either in silent or in environments with noise similar to Additive White Gaussian Noise (AWGN).

[1]  N. K. Singh,et al.  Robust Voice Activity Detection Algorithm based on Long Term Dominant Frequency and Spectral Flatness Measure , 2017 .

[2]  Erkan Zeki Engin,et al.  Noise Robust Voice Activity Detection Based on Multi-Layer Feed-Forward Neural Network , 2019 .

[3]  Turker Tuncer,et al.  Turkish vowel classification based on acoustical and decompositional features optimized by Genetic Algorithm , 2019 .

[4]  A. Antoniou Digital Signal Processing: Signals, Systems, and Filters , 2005 .

[5]  Karim Faez,et al.  Robust voice activity detection directed by noise classification , 2015, Signal Image Video Process..

[6]  Zulfiqar Ali,et al.  Innovative Method for Unsupervised Voice Activity Detection and Classification of Audio Segments , 2018, IEEE Access.

[7]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[8]  Xiao-Lei Zhang,et al.  Deep Belief Networks Based Voice Activity Detection , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Yun Lei,et al.  A noise-robust system for NIST 2012 speaker recognition evaluation , 2013, INTERSPEECH.

[10]  Sengul Dogan,et al.  Ensemble residual network-based gender and activity recognition method with signals , 2020, The Journal of Supercomputing.

[11]  Joon-Hyuk Chang,et al.  Statistical model-based voice activity detection using support vector machine , 2009 .

[12]  Zheng-Hua Tan,et al.  rVAD: An Unsupervised Segment-Based Robust Voice Activity Detection Method , 2020, Comput. Speech Lang..

[13]  John H. L. Hansen,et al.  Unsupervised Speech Activity Detection Using Voicing Measures and Perceptual Spectral Flux , 2013, IEEE Signal Processing Letters.

[14]  Tomi Kinnunen,et al.  Semi-supervised speech activity detection with an application to automatic speaker verification , 2018, Comput. Speech Lang..

[15]  Joon-Hyuk Chang,et al.  Ensemble of deep neural networks using acoustic environment classification for statistical model-based voice activity detection , 2016, Comput. Speech Lang..

[16]  Hong Liu,et al.  Improved Voice Activity Detection based on support vector machine with high separable speech feature vectors , 2014, 2014 19th International Conference on Digital Signal Processing.

[17]  Ji Wu,et al.  An efficient voice activity detection algorithm by combining statistical model and energy detection , 2011, EURASIP J. Adv. Signal Process..

[18]  K. Sreenivasa Rao,et al.  Voice/non-voice detection using phase of zero frequency filtered speech signal , 2016, Speech Commun..

[19]  Mohammad Ariful Haque,et al.  An Ensemble SVM-based Approach for Voice Activity Detection , 2018, 2018 10th International Conference on Electrical and Computer Engineering (ICECE).

[20]  Mark Liberman,et al.  Speech activity detection on youtube using deep neural networks , 2013, INTERSPEECH.

[21]  John H. L. Hansen,et al.  Robust Feature Clustering for Unsupervised Speech Activity Detection , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Gautham J. Mysore,et al.  Speaker and noise independent voice activity detection , 2013, INTERSPEECH.

[23]  O. Korniienko,et al.  Voice Activity Detection Algorithm Using Spectral-Correlation and Wavelet-Packet Transformation , 2018, Radioelectronics and Communications Systems.

[24]  Qiguang Lin,et al.  Use of Pitch Continuity for Robust Speech Activity Detection , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Sengul Dogan,et al.  Automated ambient recognition method based on dynamic center mirror local binary pattern: DCMLBP , 2020 .

[26]  Shrikanth S. Narayanan,et al.  Robust Voice Activity Detection Using Long-Term Signal Variability , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  I. Boyd,et al.  The voice activity detector for the Pan-European digital cellular mobile telephone service , 1988, International Conference on Acoustics, Speech, and Signal Processing,.

[28]  Rehan Ahmad,et al.  Unsupervised deep feature embeddings for speaker diarization , 2019 .

[29]  Yusuke Kida,et al.  Voice Activity Detection: Merging Source and Filter-based Information , 2016, IEEE Signal Processing Letters.

[30]  Tao Wang,et al.  Long-term speech information based threshold for voice activity detection in massive microphone network , 2019, Digit. Signal Process..