Deep Belief Networks Based Voice Activity Detection

Fusing the advantages of multiple acoustic features is important for the robustness of voice activity detection (VAD). Recently, the machine-learning-based VADs have shown a superiority to traditional VADs on multiple feature fusion tasks. However, existing machine-learning-based VADs only utilize shallow models, which cannot explore the underlying manifold of the features. In this paper, we propose to fuse multiple features via a deep model, called deep belief network (DBN). DBN is a powerful hierarchical generative model for feature extraction. It can describe highly variant functions and discover the manifold of the features. We take the multiple serially-concatenated features as the input layer of DBN, and then extract a new feature by transferring these features through multiple nonlinear hidden layers. Finally, we predict the class of the new feature by a linear classifier. We further analyze that even a single-hidden-layer-based belief network is as powerful as the state-of-the-art models in the machine-learning-based VADs. In our empirical comparison, ten common features are used for performance analysis. Extensive experimental results on the AURORA2 corpus show that the DBN-based VAD not only outperforms eleven referenced VADs, but also can meet the real-time detection demand of VAD. The results also show that the DBN-based VAD can fuse the advantages of multiple features effectively.

[1]  DeLiang Wang,et al.  Cocktail Party Processing via Structured Prediction , 2012, NIPS.

[2]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[3]  Yoshua Bengio,et al.  Deep Learning of Representations for Unsupervised and Transfer Learning , 2011, ICML Unsupervised and Transfer Learning.

[4]  Peter Glöckner,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2013 .

[5]  Ji Wu,et al.  Linearithmic Time Sparse and Convex Maximum Margin Clustering , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[6]  Kilian Q. Weinberger,et al.  Marginalized Denoising Autoencoders for Domain Adaptation , 2012, ICML.

[7]  Javier Ramírez,et al.  Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..

[8]  John H. L. Hansen,et al.  Discriminative Training for Multiple Observation Likelihood Ratio Based Voice Activity Detection , 2010, IEEE Signal Processing Letters.

[9]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[10]  Brian Kingsbury,et al.  Domain Adaptation in Machine Learning and Speech Processing , 2012 .

[11]  Guy J. Brown,et al.  A multi-pitch tracking algorithm for noisy speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  DeLiang Wang,et al.  A Tandem Algorithm for Singing Pitch Extraction and Voice Separation From Music Accompaniment , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  DeLiang Wang,et al.  Locally excitatory globally inhibitory oscillator networks , 1995, IEEE Transactions on Neural Networks.

[14]  Dong Yu,et al.  Deep-structured hidden conditional random fields for phonetic recognition , 2010, INTERSPEECH.

[15]  Wei Zhang,et al.  A soft voice activity detector based on a Laplacian-Gaussian model , 2003, IEEE Trans. Speech Audio Process..

[16]  Joon-Hyuk Chang,et al.  Statistical model-based voice activity detection using support vector machine , 2009 .

[17]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[18]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Dong Enqing,et al.  Applying support vector machines to voice activity detection , 2002, 6th International Conference on Signal Processing, 2002..

[20]  Ji Wu,et al.  An efficient voice activity detection algorithm by combining statistical model and energy detection , 2011, EURASIP J. Adv. Signal Process..

[21]  DeLiang Wang,et al.  Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, IEEE Transactions on Neural Networks.

[22]  Thorsten Joachims,et al.  Sparse kernel SVMs via cutting-plane training , 2009, Machine-mediated learning.

[23]  Andrew Y. Ng,et al.  Selecting Receptive Fields in Deep Networks , 2011, NIPS.

[24]  DeLiang Wang,et al.  Unvoiced Speech Segregation From Nonspeech Interference via CASA and Spectral Subtraction , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[26]  Jianwu Dang,et al.  Voice Activity Detection Based on an Unsupervised Learning Framework , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  DeLiang Wang,et al.  HMM-Based Multipitch Tracking for Noisy and Reverberant Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Ramjee Prasad,et al.  Convex Combination of Multiple Statistical Models With Application to VAD , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[30]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[31]  Miguel Á. Carreira-Perpiñán,et al.  On Contrastive Divergence Learning , 2005, AISTATS.

[32]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[33]  DeLiang Wang,et al.  Towards Generalizing Classification Based Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Albert S. Bregman,et al.  The Auditory Scene. (Book Reviews: Auditory Scene Analysis. The Perceptual Organization of Sound.) , 1990 .

[35]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[36]  Yang Lu,et al.  An algorithm that improves speech intelligibility in noise for normal-hearing listeners. , 2009, The Journal of the Acoustical Society of America.

[37]  Juan Manuel Górriz,et al.  Improved Voice Activity Detection Using Contextual Multiple Hypothesis Testing for Robust Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[38]  DeLiang Wang,et al.  Reverberant Speech Segregation Based on Multipitch Tracking and Classification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[39]  E. Shlomot,et al.  ITU-T Recommendation G.729 Annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications , 1997, IEEE Commun. Mag..

[40]  Wei Li,et al.  A new VAD framework using statistical model and human knowledge based empirical rule , 2010, INTERSPEECH.

[41]  Sadegh Rezaei,et al.  A Soft Voice Activity Detection Using GARCH Filter and Variance Gamma Distribution , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[42]  Birger Kollmeier,et al.  SNR estimation based on amplitude modulation analysis with applications to noise suppression , 2003, IEEE Trans. Speech Audio Process..

[43]  DeLiang Wang,et al.  An Unsupervised Approach to Cochannel Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[44]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[45]  DeLiang Wang,et al.  A Tandem Algorithm for Pitch Estimation and Voiced Speech Segregation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[46]  Juan Manuel Górriz,et al.  SVM-based speech endpoint detection using contextual speech features , 2006 .

[47]  D. Wang,et al.  The time dimension for scene analysis , 2005, IEEE Transactions on Neural Networks.

[48]  Ji Wu,et al.  Efficient Multiple Kernel Support Vector Machine Based Voice Activity Detection , 2011, IEEE Signal Processing Letters.

[49]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[50]  Zenglin Xu,et al.  Simple and Efficient Multiple Kernel Learning by Group Lasso , 2010, ICML.

[51]  Guy J. Brown,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[52]  Xuejing Sun,et al.  Pitch determination and voice quality analysis using Subharmonic-to-Harmonic Ratio , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[53]  Ji Wu,et al.  Maximum Margin Clustering Based Statistical VAD With Multiple Observation Compound Feature , 2011, IEEE Signal Processing Letters.

[54]  Sanjit K. Mitra,et al.  Voice activity detection based on multiple statistical models , 2006, IEEE Transactions on Signal Processing.

[55]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[56]  Javier Ramírez,et al.  Statistical voice activity detection using a multiple observation likelihood ratio test , 2005, IEEE Signal Processing Letters.

[57]  Dong Yu,et al.  Deep Learning and Its Applications to Signal and Information Processing , 2011 .

[58]  Li Deng,et al.  Learning in the Deep-Structured Conditional Random Fields , 2009 .

[59]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[60]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[61]  DeLiang Wang,et al.  A Supervised Learning Approach to Monaural Segregation of Reverberant Speech , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[62]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[63]  DeLiang Wang,et al.  Exploring Monaural Features for Classification-Based Speech Segregation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[64]  Tatsuya Kawahara,et al.  Online Unsupervised Classification With Model Comparison in the Variational Bayes Framework for Voice Activity Detection , 2010, IEEE Journal of Selected Topics in Signal Processing.

[65]  Dong Yu,et al.  Investigation of full-sequence training of deep belief networks for speech recognition , 2010, INTERSPEECH.

[66]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[67]  G. Kramer Auditory Scene Analysis: The Perceptual Organization of Sound by Albert Bregman (review) , 2016 .

[68]  Joon-Hyuk Chang,et al.  Voice activity detection based on statistical models and machine learning approaches , 2010, Comput. Speech Lang..

[69]  Sang-Ick Kang,et al.  Discriminative Weight Training for a Statistical Model-Based Voice Activity Detection , 2008, IEEE Signal Processing Letters.

[70]  Dong Yu,et al.  Deep Learning and Its Applications to Signal and Information Processing [Exploratory DSP] , 2011, IEEE Signal Processing Magazine.

[71]  Hoirin Kim,et al.  Multiple Acoustic Model-Based Discriminative Likelihood Ratio Weighting for Voice Activity Detection , 2012, IEEE Signal Processing Letters.

[72]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..