Voice activity detection in presence of transients using the scattering transform

Voice activity detection in the presence of highly non-stationary noise and transient interferences is an open problem. State-of-the-art voice activity detectors which are based on statistical models usually assume that noise is slowly varying with respect to speech. This assumption does not hold for transient interferences which are short time interruptions, and the performance of these detectors significantly deteriorates. In this paper, we propose a supervised learning algorithm for voice activity detection which is designed to perform in the presence of transients. We consider a labeled training set which comprises speech, background noise and transients, and propose a continuous measure for voice activity based on the Support Vector Machine (SVM) classifier. The measure of voice activity is constructed in a features domain, where the features are based on the scattering transform, include noise estimation, and are designed to separate speech and non-speech frames. Experimental results demonstrate that the proposed algorithm outperforms state-of-the-art detectors for different types of background noises, and in particular accurately classifies frames which contain transient interferences.

[1]  Israel Cohen,et al.  Transient Interference Suppression in Speech Signals Based on the OM-LSA Algorithm , 2012, IWAENC.

[2]  Israel Cohen,et al.  Clustering and suppression of transient noise in speech signals using diffusion maps , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Javier Ramírez,et al.  Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..

[4]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[5]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[6]  Israel Cohen,et al.  Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging , 2003, IEEE Trans. Speech Audio Process..

[7]  Israel Cohen,et al.  Voice Activity Detection in Presence of Transient Noise Using Spectral Clustering , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Joakim Andén,et al.  Deep Scattering Spectrum , 2013, IEEE Transactions on Signal Processing.

[9]  Stéphane Mallat,et al.  Group Invariant Scattering , 2011, ArXiv.

[10]  Israel Cohen,et al.  Dominant speaker identification for multipoint videoconferencing , 2012, 2012 IEEE 27th Convention of Electrical and Electronics Engineers in Israel.

[11]  Israel Cohen,et al.  Audio-Visual Voice Activity Detection Using Diffusion Maps , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[13]  Joon-Hyuk Chang,et al.  Likelihood ratio test with complex laplacian model for voice activity detection , 2003, INTERSPEECH.

[14]  Joakim Andén,et al.  Representing environmental sounds using the separable scattering transform , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[16]  Nam Soo Kim,et al.  Voice Activity Detection Based on Conditional MAP Criterion , 2008, IEEE Signal Processing Letters.

[17]  E. Shlomot,et al.  ITU-T Recommendation G.729 Annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications , 1997, IEEE Commun. Mag..

[18]  Joon-Hyuk Chang,et al.  Voice activity detection based on generalized gamma distribution , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[19]  Joon-Hyuk Chang,et al.  Voice activity detection based on complex Laplacian model , 2003 .