论文信息 - Deep Belief Networks Based Voice Activity Detection

Deep Belief Networks Based Voice Activity Detection

Fusing the advantages of multiple acoustic features is important for the robustness of voice activity detection (VAD). Recently, the machine-learning-based VADs have shown a superiority to traditional VADs on multiple feature fusion tasks. However, existing machine-learning-based VADs only utilize shallow models, which cannot explore the underlying manifold of the features. In this paper, we propose to fuse multiple features via a deep model, called deep belief network (DBN). DBN is a powerful hierarchical generative model for feature extraction. It can describe highly variant functions and discover the manifold of the features. We take the multiple serially-concatenated features as the input layer of DBN, and then extract a new feature by transferring these features through multiple nonlinear hidden layers. Finally, we predict the class of the new feature by a linear classifier. We further analyze that even a single-hidden-layer-based belief network is as powerful as the state-of-the-art models in the machine-learning-based VADs. In our empirical comparison, ten common features are used for performance analysis. Extensive experimental results on the AURORA2 corpus show that the DBN-based VAD not only outperforms eleven referenced VADs, but also can meet the real-time detection demand of VAD. The results also show that the DBN-based VAD can fuse the advantages of multiple features effectively.

Xiao-Lei Zhang | Ji Wu | Xiao-Lei Zhang | Ji Wu

[1] DeLiang Wang,et al. Cocktail Party Processing via Structured Prediction , 2012, NIPS.

[2] David Pearce,et al. The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[3] Yoshua Bengio,et al. Deep Learning of Representations for Unsupervised and Transfer Learning , 2011, ICML Unsupervised and Transfer Learning.

[4] Peter Glöckner,et al. Why Does Unsupervised Pre-training Help Deep Learning? , 2013 .

[5] Ji Wu,et al. Linearithmic Time Sparse and Convex Maximum Margin Clustering , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[6] Kilian Q. Weinberger,et al. Marginalized Denoising Autoencoders for Domain Adaptation , 2012, ICML.

[7] Javier Ramírez,et al. Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..

[8] John H. L. Hansen,et al. Discriminative Training for Multiple Observation Likelihood Ratio Based Voice Activity Detection , 2010, IEEE Signal Processing Letters.

[9] Dong Yu,et al. Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[10] Brian Kingsbury,et al. Domain Adaptation in Machine Learning and Speech Processing , 2012 .

[11] Guy J. Brown,et al. A multi-pitch tracking algorithm for noisy speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12] DeLiang Wang,et al. A Tandem Algorithm for Singing Pitch Extraction and Voice Separation From Music Accompaniment , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[13] DeLiang Wang,et al. Locally excitatory globally inhibitory oscillator networks , 1995, IEEE Transactions on Neural Networks.

[14] Dong Yu,et al. Deep-structured hidden conditional random fields for phonetic recognition , 2010, INTERSPEECH.

[15] Wei Zhang,et al. A soft voice activity detector based on a Laplacian-Gaussian model , 2003, IEEE Trans. Speech Audio Process..

[16] Joon-Hyuk Chang,et al. Statistical model-based voice activity detection using support vector machine , 2009 .

[17] Geoffrey E. Hinton,et al. Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[18] Geoffrey E. Hinton,et al. Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[19] Dong Enqing,et al. Applying support vector machines to voice activity detection , 2002, 6th International Conference on Signal Processing, 2002..

[20] Ji Wu,et al. An efficient voice activity detection algorithm by combining statistical model and energy detection , 2011, EURASIP J. Adv. Signal Process..

[21] DeLiang Wang,et al. Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, IEEE Transactions on Neural Networks.

[22] Thorsten Joachims,et al. Sparse kernel SVMs via cutting-plane training , 2009, Machine-mediated learning.

[23] Andrew Y. Ng,et al. Selecting Receptive Fields in Deep Networks , 2011, NIPS.

[24] DeLiang Wang,et al. Unvoiced Speech Segregation From Nonspeech Interference via CASA and Spectral Subtraction , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[25] Geoffrey E. Hinton. A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[26] Jianwu Dang,et al. Voice Activity Detection Based on an Unsupervised Learning Framework , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[27] DeLiang Wang,et al. HMM-Based Multipitch Tracking for Noisy and Reverberant Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[28] Ramjee Prasad,et al. Convex Combination of Multiple Statistical Models With Application to VAD , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[29] Yoshua Bengio,et al. Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[30] Jason Weston,et al. A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[31] Miguel Á. Carreira-Perpiñán,et al. On Contrastive Divergence Learning , 2005, AISTATS.

[32] Yoshua Bengio,et al. Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[33] DeLiang Wang,et al. Towards Generalizing Classification Based Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[34] Albert S. Bregman,et al. The Auditory Scene. (Book Reviews: Auditory Scene Analysis. The Perceptual Organization of Sound.) , 1990 .

[35] Marc'Aurelio Ranzato,et al. Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[36] Yang Lu,et al. An algorithm that improves speech intelligibility in noise for normal-hearing listeners. , 2009, The Journal of the Acoustical Society of America.

[37] Juan Manuel Górriz,et al. Improved Voice Activity Detection Using Contextual Multiple Hypothesis Testing for Robust Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[38] DeLiang Wang,et al. Reverberant Speech Segregation Based on Multipitch Tracking and Classification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[39] E. Shlomot,et al. ITU-T Recommendation G.729 Annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications , 1997, IEEE Commun. Mag..

[40] Wei Li,et al. A new VAD framework using statistical model and human knowledge based empirical rule , 2010, INTERSPEECH.

[41] Sadegh Rezaei,et al. A Soft Voice Activity Detection Using GARCH Filter and Variance Gamma Distribution , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[42] Birger Kollmeier,et al. SNR estimation based on amplitude modulation analysis with applications to noise suppression , 2003, IEEE Trans. Speech Audio Process..

[43] DeLiang Wang,et al. An Unsupervised Approach to Cochannel Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[44] Wonyong Sung,et al. A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[45] DeLiang Wang,et al. A Tandem Algorithm for Pitch Estimation and Voiced Speech Segregation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[46] Juan Manuel Górriz,et al. SVM-based speech endpoint detection using contextual speech features , 2006 .

[47] D. Wang,et al. The time dimension for scene analysis , 2005, IEEE Transactions on Neural Networks.

[48] Ji Wu,et al. Efficient Multiple Kernel Support Vector Machine Based Voice Activity Detection , 2011, IEEE Signal Processing Letters.

[49] Yee Whye Teh,et al. A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[50] Zenglin Xu,et al. Simple and Efficient Multiple Kernel Learning by Group Lasso , 2010, ICML.

[51] Guy J. Brown,et al. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[52] Xuejing Sun,et al. Pitch determination and voice quality analysis using Subharmonic-to-Harmonic Ratio , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[53] Ji Wu,et al. Maximum Margin Clustering Based Statistical VAD With Multiple Observation Compound Feature , 2011, IEEE Signal Processing Letters.

[54] Sanjit K. Mitra,et al. Voice activity detection based on multiple statistical models , 2006, IEEE Transactions on Signal Processing.

[55] Dong Yu,et al. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[56] Javier Ramírez,et al. Statistical voice activity detection using a multiple observation likelihood ratio test , 2005, IEEE Signal Processing Letters.

[57] Dong Yu,et al. Deep Learning and Its Applications to Signal and Information Processing , 2011 .

[58] Li Deng,et al. Learning in the Deep-Structured Conditional Random Fields , 2009 .

[59] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[60] Qiang Yang,et al. A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[61] DeLiang Wang,et al. A Supervised Learning Approach to Monaural Segregation of Reverberant Speech , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[62] Chih-Jen Lin,et al. A Practical Guide to Support Vector Classication , 2008 .

[63] DeLiang Wang,et al. Exploring Monaural Features for Classification-Based Speech Segregation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[64] Tatsuya Kawahara,et al. Online Unsupervised Classification With Model Comparison in the Variational Bayes Framework for Voice Activity Detection , 2010, IEEE Journal of Selected Topics in Signal Processing.

[65] Dong Yu,et al. Investigation of full-sequence training of deep belief networks for speech recognition , 2010, INTERSPEECH.

[66] Nir Friedman,et al. Probabilistic Graphical Models - Principles and Techniques , 2009 .

[67] G. Kramer. Auditory Scene Analysis: The Perceptual Organization of Sound by Albert Bregman (review) , 2016 .

[68] Joon-Hyuk Chang,et al. Voice activity detection based on statistical models and machine learning approaches , 2010, Comput. Speech Lang..

[69] Sang-Ick Kang,et al. Discriminative Weight Training for a Statistical Model-Based Voice Activity Detection , 2008, IEEE Signal Processing Letters.

[70] Dong Yu,et al. Deep Learning and Its Applications to Signal and Information Processing [Exploratory DSP] , 2011, IEEE Signal Processing Magazine.

[71] Hoirin Kim,et al. Multiple Acoustic Model-Based Discriminative Likelihood Ratio Weighting for Voice Activity Detection , 2012, IEEE Signal Processing Letters.

[72] Yoshua. Bengio,et al. Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..