论文信息 - AUC Optimization for Deep Learning Based Voice Activity Detection

AUC Optimization for Deep Learning Based Voice Activity Detection

Voice activity detection (VAD) based on deep neural networks (DNN) has demonstrated good performance in adverse acoustic environments. Current DNN based VAD optimizes a surrogate function, e.g. minimum cross-entropy or minimum squared error, at a given decision threshold. However, VAD usually works on-the-fly with a dynamic decision threshold; and ROC curve is a global evaluation metric of VAD that reflects the performance of VAD at all possible decision thresholds. In this paper, we propose to optimize the area under ROC curve (AUC) by DNN, which can maximize the performance of VAD in terms of the ROC curve. Experimental results show that optimizing AUC by DNN results in higher performance than the common method of optimizing the minimum squared error by DNN.

Susanto Rahardja | Xiao-Lei Zhang | Jingdong Chen | Zhongxin Bai | Zi-Chen Fan

[1] Xiao-Lei Zhang,et al. Deep Belief Networks Based Voice Activity Detection , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[2] DeLiang Wang,et al. A feature study for classification-based speech separation at very low signal-to-noise ratio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] DeLiang Wang,et al. A Feature Study for Classification-Based Speech Separation at Low Signal-to-Noise Ratios , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4] Joon-Hyuk Chang,et al. Statistical modeling of speech signals based on generalized gamma distribution , 2005, IEEE Signal Process. Lett..

[5] Björn W. Schuller,et al. Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6] Shi-Wen Deng,et al. Statistical voice activity detection based on sparse representation over learned dictionary , 2013, Digit. Signal Process..

[7] Yunde Jia,et al. Voice Activity Detection Via Noise Reducing Using Non-Negative Sparse Coding , 2013, IEEE Signal Processing Letters.

[8] Wonyong Sung,et al. A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[9] R. Tucker,et al. Voice activity detection using a periodicity measure , 1992 .

[10] Matthai Philipose,et al. Limiting Numerical Precision of Neural Networks to Achieve Real-Time Voice Activity Detection , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Yuuki Tachioka. Dnn-Based Voice Activity Detection Using Auxiliary Speech Models in Noisy Environments , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] DeLiang Wang,et al. Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13] Jianwu Dang,et al. Phase aware deep neural network for noise robust voice activity detection , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[14] Thad Hughes,et al. Recurrent neural networks for voice activity detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15] Climent Nadeu,et al. Robust speech activity detection using LDA applied to FF parameters , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[16] Joon-Hyuk Chang,et al. Voice activity detection based on complex Laplacian model , 2003 .

[17] Joon-Hyuk Chang,et al. Ensemble of deep neural networks using acoustic environment classification for statistical model-based voice activity detection , 2016, Comput. Speech Lang..

[18] Naomi Harte,et al. Voice Activity Detection Using Neurograms , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Jun Du,et al. A universal VAD based on jointly trained deep neural networks , 2015, INTERSPEECH.

[20] Ji Wu,et al. Efficient Multiple Kernel Support Vector Machine Based Voice Activity Detection , 2011, IEEE Signal Processing Letters.

[21] Sanjit K. Mitra,et al. Voice activity detection based on multiple statistical models , 2006, IEEE Transactions on Signal Processing.

[22] Javier Ramírez,et al. Statistical voice activity detection using a multiple observation likelihood ratio test , 2005, IEEE Signal Processing Letters.

[23] Rafik A. Goubran,et al. Robust voice activity detection using higher-order statistics in the LPC residual domain , 2001, IEEE Trans. Speech Audio Process..

[24] Israel Cohen,et al. Multimodal Kernel Method for Activity Detection of Sound Sources , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25] Hoirin Kim,et al. Joint Learning Using Denoising Variational Autoencoders for Voice Activity Detection , 2018, INTERSPEECH.

[26] H. Wakita,et al. A comparative study of cepstral lifters and distance measures for all pole models of speech in noise , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[27] Harikrishna Narasimhan,et al. A Structural SVM Based Approach for Optimizing Partial AUC , 2013, ICML.

[28] Ji Wu,et al. Denoising deep neural networks based voice activity detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.