Robust Voice Activity Detection Based on Complementary BLSTM Enhancement Stage

In this paper, we propose a new two-stage deep structure with a joint learning technique to improve Voice Activity Detection (VAD) in different noisy conditions especially in unseen noises. The first stage of our proposed method deals with the enhancement of the noisy signal, which is complementary to the second stage. Bidirectional Long Short-Term Memory (BLSTM) architecture is used in this part so as to take benefit from both previous and upcoming frames. The second stage uses the enhanced frames features to predict the speech presence probability. Based on previous studies, we use Multi-Resolution Cochleagram (MRCG) features to achieve higher robustness. We evaluate our proposed method using the Area Under the Curve (AUC) and precision metrics in TIMIT corpus. Based on our evaluations, the proposed method outperforms other state-of-the-art methods based on deep structures as baseline, both in AUC and precision metrics. The proposed method's AUC improvement versus other methods, in noises not seen in the training step, is significant.

[1]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[2]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[3]  Björn W. Schuller,et al.  A multi-stream ASR framework for BLSTM modeling of conversational speech , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Kai Yu,et al.  A comparative study of robustness of deep learning approaches for VAD , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Chungyong Lee,et al.  Robust voice activity detection algorithm for estimating noise spectrum , 2000 .

[6]  DeLiang Wang,et al.  Boosted deep neural networks and multi-resolution cochleagram features for voice activity detection , 2014, INTERSPEECH.

[7]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[8]  Saeed Gazor,et al.  On the distribution of Mel-filtered log-spectrum of speech in additive noise , 2015, Speech Commun..

[9]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[10]  Atsuhiko Kai,et al.  Robust Voice Activity Detector by combining sequentially trained Deep Neural Networks , 2016, 2016 International Conference On Advanced Informatics: Concepts, Theory And Application (ICAICTA).

[11]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[12]  DeLiang Wang,et al.  Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Hoirin Kim,et al.  Joint Learning Using Denoising Variational Autoencoders for Voice Activity Detection , 2018, INTERSPEECH.

[14]  Yeonguk Yu,et al.  A Voice Activity Detection Model Composed of Bidirectional LSTM and Attention Mechanism , 2018, 2018 IEEE 10th International Conference on Humanoid, Nanotechnology, Information Technology,Communication and Control, Environment and Management (HNICEM).

[15]  Maik Thiele,et al.  Setting Goals and Choosing Metrics for Recommender System Evaluations , 2011 .

[16]  Yuuki Tachioka Dnn-Based Voice Activity Detection Using Auxiliary Speech Models in Noisy Environments , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Damjan Vlaj,et al.  A Computationally Efficient Mel-Filter Bank VAD Algorithm for Distributed Speech Recognition Systems , 2005, EURASIP J. Adv. Signal Process..

[18]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[19]  Xiao-Lei Zhang,et al.  Deep Belief Networks Based Voice Activity Detection , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Jean-Claude Junqua,et al.  A study of endpoint detection algorithms in adverse conditions: incidence on a DTW and HMM recognizer , 1991, EUROSPEECH.

[21]  Spyridon Matsoukas,et al.  Developing a Speech Activity Detection System for the DARPA RATS Program , 2012, INTERSPEECH.

[22]  Sanaz Seyedin,et al.  Robust MVDR-based feature extraction for speech recognition , 2009, 2009 7th International Conference on Information, Communications and Signal Processing (ICICS).