Voice Activity Detection for Transient Noisy Environment Based on Diffusion Nets

We address voice activity detection in acoustic environments of transients and stationary noises, which often occur in real-life scenarios. We exploit unique spatial patterns of speech and non-speech audio frames by independently learning their underlying geometric structure. This process is done through a deep encoder–decoder-based neural network architecture. This structure involves an encoder that maps spectral features with temporal information to their low-dimensional representations, which are generated by applying the diffusion maps method. The encoder feeds a decoder that maps the embedded data back into the high-dimensional space. A deep neural network, which is trained to separate speech from non-speech frames, is obtained by concatenating the decoder to the encoder, resembling the known diffusion nets architecture. Experimental results show enhanced performance compared to competing voice activity detection methods. The improvement is achieved in both accuracy, robustness, and generalization ability. Our model performs in a real-time manner and can be integrated into audio-based communication systems. We also present a batch algorithm that obtains an even higher accuracy for offline applications.

[1]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[2]  Yoel Shkolnisky,et al.  Diffusion Interpretation of Nonlocal Neighborhood Filters for Signal Denoising , 2009, SIAM J. Imaging Sci..

[3]  Joon-Hyuk Chang,et al.  Voice activity detection based on statistical models and machine learning approaches , 2010, Comput. Speech Lang..

[4]  WeiTyng Hong,et al.  Voice Activity Detection based on Noise-Immunity Recurrent Neural Networks , 2013 .

[5]  Ronald R. Coifman,et al.  Diffusion Maps for Signal Processing: A Deeper Look at Manifold-Learning Techniques Based on Kernels and Graphs , 2013, IEEE Signal Processing Magazine.

[6]  Stéphane Lafon,et al.  Diffusion maps , 2006 .

[7]  Israel Cohen,et al.  Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging , 2003, IEEE Trans. Speech Audio Process..

[8]  Israel Cohen,et al.  Voice Activity Detection in Presence of Transient Noise Using Spectral Clustering , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[10]  Yosi Keller,et al.  Image Completion by Diffusion Maps and Spectral Relaxation , 2013, IEEE Transactions on Image Processing.

[11]  Ji Wu,et al.  Maximum Margin Clustering Based Statistical VAD With Multiple Observation Compound Feature , 2011, IEEE Signal Processing Letters.

[12]  Israel Cohen,et al.  A deep architecture for audio-visual voice activity detection in the presence of transients , 2018, Signal Process..

[13]  Ronald R. Coifman,et al.  Data Fusion and Multicue Data Matching by Diffusion Maps , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Israel Cohen,et al.  Speech enhancement for non-stationary noise environments , 2001, Signal Process..

[15]  Israel Cohen,et al.  Multiscale Anomaly Detection Using Diffusion Maps , 2013, IEEE Journal of Selected Topics in Signal Processing.

[16]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[17]  Jean-Claude Junqua,et al.  A robust algorithm for word boundary detection in the presence of noise , 1994, IEEE Trans. Speech Audio Process..

[18]  R. Coifman,et al.  Geometric harmonics: A novel tool for multiscale out-of-sample extension of empirical functions , 2006 .

[19]  Thad Hughes,et al.  Recurrent neural networks for voice activity detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Israel Cohen,et al.  Voice activity detection in presence of transients using the scattering transform , 2014, 2014 IEEE 28th Convention of Electrical & Electronics Engineers in Israel (IEEEI).

[21]  Valentin Mendelev,et al.  Robust Voice Activity Detection with Deep Maxout Neural Networks , 2015 .

[22]  Ronald R. Coifman,et al.  Texture separation via a reference set , 2014 .

[23]  Roland Badeau,et al.  Singing voice detection with deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Steven Gold,et al.  Softmax to Softassign: neural network algorithms for combinatorial optimization , 1996 .

[25]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[26]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[28]  Israel Cohen,et al.  Single-Channel Transient Interference Suppression With Diffusion Maps , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Ronald R. Coifman,et al.  Diffusion maps for changing data , 2012, ArXiv.

[30]  Alexander Cloninger,et al.  Diffusion Nets , 2015, Applied and Computational Harmonic Analysis.

[31]  Pietro Perona,et al.  Self-Tuning Spectral Clustering , 2004, NIPS.

[32]  Masato Ishikawa,et al.  A robust audio-visual speech recognition using audio-visual voice activity detection , 2010, INTERSPEECH.

[33]  David A. Krubsack,et al.  An autocorrelation pitch detector and voicing decision with confidence measures developed for noise-corrupted speech , 1991, IEEE Trans. Signal Process..

[34]  Zeev Farbman,et al.  Diffusion maps for edge-aware image editing , 2010, ACM Trans. Graph..

[35]  Fei Xie,et al.  A comparative study of speech detection methods , 1997, EUROSPEECH.

[36]  Eun-Kyoung Kim,et al.  Enhanced voice activity detection using acoustic event detection and classification , 2011, IEEE Transactions on Consumer Electronics.

[37]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[38]  Gil David,et al.  Hierarchical data organization , clustering and denoising via localized diffusion folders , 2011 .

[39]  Sanjit K. Mitra,et al.  Voice activity detection based on multiple statistical models , 2006, IEEE Transactions on Signal Processing.

[40]  Javier Ramírez,et al.  Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..

[41]  Israel Cohen,et al.  Kernel Method for Voice Activity Detection in the Presence of Transients , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[42]  Israel Cohen,et al.  Audio-Visual Voice Activity Detection Using Diffusion Maps , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.