Convolutional Recurrent Neural Networks for Speech Activity Detection in Naturalistic Audio from Apollo Missions

Speech Activity Detection (SAD) aims to correctly distinguish audio segments containing human speech. Several solutions have been successfully applied to the SAD task, with deep learning approaches being specially relevant nowadays. This paper describes a SAD solution based on Convolutional Recurrent Neural Networks (CRNN) presented as the ViVoLab submission to the 2020 Fearless steps challenge. The dataset used comes from the audio of Apollo space missions, presenting a challenging domain with strong degradation and several transmission noises. First, we explore the performance of 1D and 2D convolutional processing stages. Then we propose a novel architecture that executes the fusion of two convolutional feature maps by combining the information captured with 1D and 2D filters. Obtained results largely outperform the baseline provided by the organisation. They were able to achieve a detection cost function below 2% on the development set for all configurations. Best results were reported on the presented fusion architecture, with a DCF metric of 1.78% on the evaluation set and ranking fourth among all the participant teams in the challenge SAD task.

[1]  Jean-Claude Junqua,et al.  A robust algorithm for word boundary detection in the presence of noise , 1994, IEEE Trans. Speech Audio Process..

[2]  Fei Xie,et al.  A comparative study of speech detection methods , 1997, EUROSPEECH.

[3]  Susanto Rahardja,et al.  AUC Optimization for Deep Learning Based Voice Activity Detection , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Spyridon Matsoukas,et al.  Developing a Speech Activity Detection System for the DARPA RATS Program , 2012, INTERSPEECH.

[5]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Jaeseok Kim,et al.  Vowel based Voice Activity Detection with LSTM Recurrent Neural Network , 2016, ICSPS 2016.

[7]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[8]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Eduardo Lleida,et al.  Estimation of the Number of Speakers with Variational Bayesian PLDA in the DIHARD Diarization Challenge , 2018, INTERSPEECH.

[11]  Luca Romeo,et al.  Convolutional Recurrent Neural Networks and Acoustic Data Augmentation for Snore Detection , 2019, Neural Approaches to Dynamics of Signal Exchanges.

[12]  Sanjit K. Mitra,et al.  Voice activity detection based on multiple statistical models , 2006, IEEE Transactions on Signal Processing.

[13]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[14]  Javier Ramírez,et al.  Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..

[15]  Lukás Burget,et al.  Acoustic Scene Classification Using Fusion of Attentive Convolutional Neural Networks for DCASE2019 Challenge , 2019, ArXiv.

[16]  John H. L. Hansen,et al.  The 2019 Inaugural Fearless Steps Challenge: A Giant Leap for Naturalistic Audio , 2019, INTERSPEECH.

[17]  Eduardo Lleida,et al.  In-domain Adaptation Solutions for the RTVE 2018 Diarization Challenge , 2018, IberSPEECH.

[18]  John H. L. Hansen,et al.  FEARLESS STEPS Challenge (FS-2): Supervised Learning with Massive Naturalistic Apollo Data , 2020, INTERSPEECH.

[19]  John H. L. Hansen,et al.  Speech activity detection for NASA apollo space missions: challenges and solutions , 2014, INTERSPEECH.

[20]  Björn W. Schuller,et al.  Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[22]  Longxing Shi,et al.  An energy-efficient voice activity detector using deep neural networks and approximate computing , 2019, Microelectron. J..

[23]  Jing Li,et al.  End-to-End Sequence Labeling via Convolutional Recurrent Neural Network with a Connectionist Temporal Classification Layer , 2020, Int. J. Comput. Intell. Syst..

[24]  John H. L. Hansen,et al.  Fearless Steps: Apollo-11 Corpus Advancements for Speech Technologies from Earth to the Moon , 2018, INTERSPEECH.

[25]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[26]  Dimitrios Tzovaras,et al.  Two-Dimensional Convolutional Recurrent Neural Networks for Speech Activity Detection , 2019, INTERSPEECH.

[27]  Francesco Piazza,et al.  Deep neural networks for Multi-Room Voice Activity Detection: Advancements and comparative evaluation , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[28]  Chungyong Lee,et al.  Robust voice activity detection algorithm for estimating noise spectrum , 2000 .