论文信息 - Boosted deep neural networks and multi-resolution cochleagram features for voice activity detection

Boosted deep neural networks and multi-resolution cochleagram features for voice activity detection

Voice activity detection (VAD) is an important frontend of many speech processing systems. In this paper, we describe a new VAD algorithm based on boosted deep neural networks (bDNNs). The proposed algorithm first generates multiple base predictions for a single frame from only one DNN and then aggregates the base predictions for a better prediction of the frame. Moreover, we employ a new acoustic feature, multi-resolution cochleagram (MRCG), that concatenates the cochleagram features at multiple spectrotemporal resolutions and shows superior speech separation results over many acoustic features. Experimental results show that bDNN-based VAD with the MRCG feature outperforms state-of-the-art VADs by a considerable margin.

DeLiang Wang | Xiao-Lei Zhang | Deliang Wang | Xiao-Lei Zhang

[1] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[2] Mark Liberman,et al. Speech activity detection on youtube using deep neural networks , 2013, INTERSPEECH.

[3] Wonyong Sung,et al. A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[4] Yunde Jia,et al. Voice Activity Detection Via Noise Reducing Using Non-Negative Sparse Coding , 2013, IEEE Signal Processing Letters.

[5] DeLiang Wang,et al. A feature study for classification-based speech separation at very low signal-to-noise ratio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Guy J. Brown,et al. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[7] Geoffrey E. Hinton,et al. On the importance of initialization and momentum in deep learning , 2013, ICML.

[8] Xiao-Lei Zhang,et al. Deep Belief Networks Based Voice Activity Detection , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[9] Joon-Hyuk Chang,et al. Voice activity detection based on statistical models and machine learning approaches , 2010, Comput. Speech Lang..

[10] Sridhar Krishna Nemala,et al. A Multistream Feature Framework Based on Bandpass Modulation Filtering for Robust Speech Recognition , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[11] John H. L. Hansen,et al. Discriminative Training for Multiple Observation Likelihood Ratio Based Voice Activity Detection , 2010, IEEE Signal Processing Letters.

[12] Hoirin Kim,et al. Multiple Acoustic Model-Based Discriminative Likelihood Ratio Weighting for Voice Activity Detection , 2012, IEEE Signal Processing Letters.

[13] DeLiang Wang,et al. Auditory Segmentation Based on Onset and Offset Analysis , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[14] Jianwu Dang,et al. Voice Activity Detection Based on an Unsupervised Learning Framework , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[15] John H. L. Hansen,et al. Robust front-end processing for speaker identification over extremely degraded communication channels , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16] Thad Hughes,et al. Recurrent neural networks for voice activity detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17] Athanasios Katsamanis,et al. Multi-band long-term signal variability features for robust voice activity detection , 2013, INTERSPEECH.

[18] Israel Cohen,et al. Voice Activity Detection in Presence of Transient Noise Using Spectral Clustering , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[19] Javier Ramírez,et al. Statistical voice activity detection using a multiple observation likelihood ratio test , 2005, IEEE Signal Processing Letters.

[20] Ramjee Prasad,et al. Convex Combination of Multiple Statistical Models With Application to VAD , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[21] Yoshihiko Nankaku,et al. Voice activity detection based on conditional random fields using multiple features , 2010, INTERSPEECH.

[22] Dong Yu,et al. Deep-structured hidden conditional random fields for phonetic recognition , 2010, INTERSPEECH.

[23] Thomas G. Dietterich. Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[24] Tara N. Sainath,et al. Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.