Two-Dimensional Convolutional Recurrent Neural Networks for Speech Activity Detection

Speech Activity Detection (SAD) plays an important role in mobile communications and automatic speech recognition (ASR). Developing efficient SAD systems for real-world applications is a challenging task due to the presence of noise. We propose a new approach to SAD where we treat it as a twodimensional multilabel image classification problem. To classify the audio segments, we compute their Short-time Fourier Transform spectrograms and classify them with a Convolutional Recurrent Neural Network (CRNN), traditionally used in image recognition. Our CRNN uses a sigmoid activation function, max-pooling in the frequency domain, and a convolutional operation as a moving average filter to remove misclassified spikes. On the development set of Task 1 of the 2019 Fearless Steps Challenge, our system achieved a decision cost function (DCF) of 2.89%, a 66.4% improvement over the baseline. Moreover, it achieved a DCF score of 3.318% on the evaluation dataset of the challenge, ranking first among all submissions.

[1]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3]  Spyridon Matsoukas,et al.  Developing a Speech Activity Detection System for the DARPA RATS Program , 2012, INTERSPEECH.

[4]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[5]  Alan McCree,et al.  Speaker diarization using deep neural network embeddings , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Yun Lei,et al.  All for one: feature combination for highly channel-degraded speech activity detection , 2013, INTERSPEECH.

[7]  John H. L. Hansen,et al.  The 2019 Inaugural Fearless Steps Challenge: A Giant Leap for Naturalistic Audio , 2019, INTERSPEECH.

[8]  Longxing Shi,et al.  An energy-efficient voice activity detector using deep neural networks and approximate computing , 2019, Microelectron. J..

[9]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[10]  Yoshihiko Nankaku,et al.  Voice activity detection based on conditional random fields using multiple features , 2010, INTERSPEECH.

[11]  Sridha Sridharan,et al.  Noise robust voice activity detection using features extracted from the time-domain autocorrelation function , 2010, INTERSPEECH.

[12]  Xiaoqiang Zhu,et al.  A Self-adapting GMM based Voice Activity Detection , 2018, 2018 IEEE 23rd International Conference on Digital Signal Processing (DSP).

[13]  Francesco Piazza,et al.  Deep neural networks for Multi-Room Voice Activity Detection: Advancements and comparative evaluation , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[14]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[15]  John H. L. Hansen,et al.  Fearless Steps: Apollo-11 Corpus Advancements for Speech Technologies from Earth to the Moon , 2018, INTERSPEECH.

[16]  Tomi Kinnunen,et al.  Semi-supervised speech activity detection with an application to automatic speaker verification , 2018, Comput. Speech Lang..

[17]  Petros Maragos,et al.  Speech event detection using multiband modulation energy , 2005, INTERSPEECH.

[18]  Christoph Meinel,et al.  Language Identification Using Deep Convolutional Recurrent Neural Networks , 2017, ICONIP.

[19]  John H. L. Hansen,et al.  Robust Feature Clustering for Unsupervised Speech Activity Detection , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Thad Hughes,et al.  Recurrent neural networks for voice activity detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Jean-Luc Gauvain,et al.  Optimization of RNN-Based Speech Activity Detection , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Rainer Martin,et al.  Improved A Posteriori Speech Presence Probability Estimation Based on a Likelihood Ratio With Fixed Priors , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Rainer Martin,et al.  Single‐Channel Speech Presence Probability Estimation and Noise Tracking , 2018 .

[24]  Shrikanth S. Narayanan,et al.  Robust Voice Activity Detection Using Long-Term Signal Variability , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Kuldip K. Paliwal,et al.  Usefulness of phase spectrum in human speech perception , 2003, INTERSPEECH.

[26]  Brian Hanson,et al.  Robust speaker-independent word recognition using static, dynamic and acceleration features: experiments with Lombard and noisy speech , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[27]  Saeed Vaseghi,et al.  Comparing noise compensation methods for robust prediction of acoustic speech features from MFCC vectors in noise , 2008, 2008 16th European Signal Processing Conference.

[28]  Andreas Stolcke,et al.  The Microsoft 2017 Conversational Speech Recognition System , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Javier Ramírez,et al.  Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..