Statistical and Neural Network Based Speech Activity Detection in Non-Stationary Acoustic Environments

Speech activity detection (SAD), which often rests on the fact that the noise is "more" stationary than speech, is particularly challenging in non-stationary environments, because the time variance of the acoustic scene makes it difficult to discriminate speech from noise. We propose two approaches to SAD, where one is based on statistical signal processing, while the other utilizes neural networks. The former employes sophisticated signal processing to track the noise and speech energies and is meant to support the case for a resource efficient, unsupervised signal processing approach. The latter introduces a recurrent network layer that operates on short segments of the input speech to do temporal smoothing in the presence of non-stationary noise. The systems are tested on the Fearless Steps challenge, which consists of the transmission data from the Apollo-11 space mission. The statistical SAD achieves comparable detection performance to earlier proposed neural network based SADs, while the neural network based approach leads to a decision cost function of 1.07% on the evaluation set of the 2020 Fearless Steps Challenge, which sets a new state of the art.

[1]  Brian Kingsbury,et al.  Improvements to the IBM speech activity detection system for the DARPA RATS program , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3]  Longxing Shi,et al.  An energy-efficient voice activity detector using deep neural networks and approximate computing , 2019, Microelectron. J..

[4]  Dimitrios Tzovaras,et al.  Two-Dimensional Convolutional Recurrent Neural Networks for Speech Activity Detection , 2019, INTERSPEECH.

[5]  DeLiang Wang,et al.  Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Rainer Martin,et al.  Single‐Channel Speech Presence Probability Estimation and Noise Tracking , 2018 .

[7]  Sri Harsha Dumpala,et al.  An Algorithm for Detection of Breath Sounds in Spontaneous Speech with Application to Speaker Recognition , 2017, SPECOM.

[8]  Noah A. Smith,et al.  Segmental Recurrent Neural Networks , 2015, ICLR.

[9]  Thad Hughes,et al.  Recurrent neural networks for voice activity detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[11]  Haizhou Li,et al.  Multi-Level Adaptive Speech Activity Detector for Speech in Naturalistic Environments , 2019, INTERSPEECH.

[12]  John H. L. Hansen,et al.  Speech Activity Detection in Naturalistic Audio Environments: Fearless Steps Apollo Corpus , 2018, IEEE Signal Processing Letters.

[13]  Reinhold Haeb-Umbach,et al.  Weakly Supervised Sound Activity Detection and Event Classification in Acoustic Sensor Networks , 2019, 2019 IEEE 8th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP).

[14]  Yun Lei,et al.  All for one: feature combination for highly channel-degraded speech activity detection , 2013, INTERSPEECH.

[15]  Sanjit K. Mitra,et al.  Voice activity detection based on multiple statistical models , 2006, IEEE Transactions on Signal Processing.

[16]  Mark Liberman,et al.  Speech activity detection on youtube using deep neural networks , 2013, INTERSPEECH.

[17]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[18]  Mark Hasegawa-Johnson,et al.  On the Applicability of Speaker Diarization to Audio Indexing of Non-Speech and Mixed Non-Speech/Speech Video Soundtracks , 2012, Int. J. Multim. Data Eng. Manag..

[19]  John H. L. Hansen,et al.  Speech activity detection for NASA apollo space missions: challenges and solutions , 2014, INTERSPEECH.

[20]  Rainer Martin,et al.  Improved A Posteriori Speech Presence Probability Estimation Based on a Likelihood Ratio With Fixed Priors , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  John H. L. Hansen,et al.  Fearless Steps: Apollo-11 Corpus Advancements for Speech Technologies from Earth to the Moon , 2018, INTERSPEECH.

[22]  Yoshua Bengio,et al.  Hierarchical Multiscale Recurrent Neural Networks , 2016, ICLR.

[23]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[25]  Ernst Warsitz,et al.  Zweistufige Sprache/Pause-Detektion in stark gestoerter Umgebung , 2007 .