A Deep Neural Network approach for Voice Activity Detection in multi-room domestic scenarios

This paper presents a Voice Activity Detector (VAD) for multi-room domestic scenarios. A multi-room VAD (mVAD) simultaneously detects the time boundaries of a speech segment and determines the room where it was generated. The proposed approach is fully data-driven and is based on a Deep Neural Network (DNN) pre-trained as a Deep Belief Network (DBN) and fine-tuned by a standard error back-propagation method. Six different types of feature sets are extracted and combined from multiple microphone signals in order to perform the classification. The proposed DBN-DNN multi-room VAD (simply referred to as DBN-mVAD) is compared to other two NN based mVADs: a Multi-Layer Perceptron (MLP-mVAD) and a Bidirectional Long Short-Term Memory recurrent neural network (BLSTM-mVAD). A large multi-microphone dataset, recorded in a home, is used to assess the performance through a multi-stage analysis strategy comprising multiple feature selection stages alternated by network size and input microphones selections. The proposed approach notably outperforms the alternative algorithms in the first feature selection stage and in the network selection one. In terms of area under precision-recall curve (AUC), the absolute increment respect to the BLST-mVAD is 5.55%, while respect to the MLP-mVAD is 2.65%. Hence, solely the proposed approach undergoes the remaining selection stages. In particular, the DBN-mVAD achieves significant improvements: in terms of AUC and F-measure the absolute increments are equal to 10.41% and 8.56% with respect to the first stage of DBN-mVAD.

[1]  Ji Wu,et al.  An efficient voice activity detection algorithm by combining statistical model and energy detection , 2011, EURASIP J. Adv. Signal Process..

[2]  B. Kollmeier,et al.  Speech enhancement based on physiological and psychoacoustical models of modulation perception and binaural interaction. , 1994, The Journal of the Acoustical Society of America.

[3]  Martin Wolf,et al.  Channel selection measures for multi-microphone speech recognition , 2014, Speech Commun..

[4]  E. Shlomot,et al.  ITU-T Recommendation G.729 Annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications , 1997, IEEE Commun. Mag..

[5]  Joon-Hyuk Chang,et al.  Voice activity detection based on generalized gamma distribution , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[6]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[7]  Javier Ramírez,et al.  Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..

[8]  Erik Marchi,et al.  Multi-resolution linear prediction based features for audio onset detection with bidirectional LSTM neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Li Deng,et al.  A tutorial survey of architectures, algorithms, and applications for deep learning , 2014, APSIPA Transactions on Signal and Information Processing.

[10]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[11]  Petros Maragos,et al.  The DIRHA simulated corpus , 2014, LREC.

[12]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[13]  DeLiang Wang,et al.  Boosted deep neural networks and multi-resolution cochleagram features for voice activity detection , 2014, INTERSPEECH.

[14]  Francesco Piazza,et al.  A distributed system for recognizing home automation commands and distress calls in the Italian language , 2013, INTERSPEECH.

[15]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[16]  Petros Maragos,et al.  The Athena-RC system for speech activity detection and speaker localization in the DIRHA smart home , 2014, 2014 4th Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA).

[17]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[18]  Alessio Brutti,et al.  A speech event detection and localization task for multiroom environments , 2014, 2014 4th Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA).

[19]  D. J. Hermes,et al.  Measurement of pitch by subharmonic summation. , 1988, The Journal of the Acoustical Society of America.

[20]  Thad Hughes,et al.  Recurrent neural networks for voice activity detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[22]  S. Squartini,et al.  Neural Networks Based Methods for Voice Activity Detection in a Multi-room Domestic Environment , 2014 .

[23]  Chungyong Lee,et al.  Robust voice activity detection algorithm for estimating noise spectrum , 2000 .

[24]  Yuuki Tachioka,et al.  Ensemble integration of calibrated speaker localization and statistical speech detection in domestic environments , 2014, 2014 4th Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA).

[25]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[26]  Eduard A. Jorswieck,et al.  Sum Rate Optimization by Spatial Precoding for a Multiuser MIMO DFT-Precoded OFDM Uplink , 2011, EURASIP J. Adv. Signal Process..

[27]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[28]  Joon-Hyuk Chang,et al.  Voice activity detection based on statistical models and machine learning approaches , 2010, Comput. Speech Lang..

[29]  I. Cohen,et al.  AR-GARCH in Presence of Noise: Parameter Estimation and Its Application to Voice Activity Detection , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[31]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[32]  Sanjit K. Mitra,et al.  Voice activity detection based on multiple statistical models , 2006, IEEE Transactions on Signal Processing.

[33]  Javier Ramírez,et al.  Statistical voice activity detection using a multiple observation likelihood ratio test , 2005, IEEE Signal Processing Letters.

[34]  Richard M. Stern,et al.  Robust speech recognition using temporal masking and thresholding algorithm , 2014, INTERSPEECH.