Speech Activity Detection in Naturalistic Audio Environments: Fearless Steps Apollo Corpus

Speech activity detection (SAD) is a fundamental building block for most spoken language technology systems. Developing efficient SAD systems in highly naturalist data scenarios is a challenge. In this study, we investigate the SAD problem on NASAs Apollo space mission data <xref ref-type="bibr" rid="ref1">[1]</xref>. Apollo data consists of long-term naturalistic audio recordings (i.e., 6–12 day missions). The Apollo data poses various challenges like: 1) noise distortion with variable SNR, 2) channel distortion, 3) very high density of speech, 4) foreground versus background speech, and 5) extended periods of nonspeech activity. In this study, we use threshold optimized combo-SAD <xref ref-type="bibr" rid="ref21">[21]</xref> as our baseline unsupervised system. This technique was developed to address variable speech/nonspeech density issues in long-term audio data. To mitigate issues related to Apollo audio loops, multispeaker scenarios including foreground versus background conversations within loops, and highly noisy background, a new curriculum learning (CL) based convolutional neural network (CNN) model is developed. This efficient method leverages the long-term learning capability of CNN and CL strategies where data are trained in a manner that improves the efficiency during the learning process. Here, we use signal-to-noise ratio as the learning parameter. Our experiments on free flowing Apollo audio data show that the proposed approach provides a significant improvement in SAD performance (<inline-formula><tex-math notation="LaTeX">$> 10\%$</tex-math></inline-formula>).

[1]  Jianwu Dang,et al.  Phase aware deep neural network for noise robust voice activity detection , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[2]  George Saon,et al.  Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  John H. L. Hansen,et al.  A Speaker Diarization System for Studying Peer-Led Team Learning Groups , 2016, INTERSPEECH.

[4]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[5]  John H. L. Hansen,et al.  Multi-Channel Apollo Mission Speech Transcripts Calibration , 2017, INTERSPEECH.

[6]  John H. L. Hansen,et al.  Unsupervised Speech Activity Detection Using Voicing Measures and Perceptual Spectral Flux , 2013, IEEE Signal Processing Letters.

[7]  John H. L. Hansen,et al.  Prof-Life-Log: Personal interaction analysis for naturalistic audio streams , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  John H. L. Hansen,et al.  'houston, we have a solution': using NASA apollo program to advance speech and language processing technology , 2013, INTERSPEECH.

[9]  John H. L. Hansen,et al.  Sentiment extraction from natural audio streams , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Wei-Ping Zhu,et al.  Design and Performance Analysis of Bayesian, Neyman–Pearson, and Competitive Neyman–Pearson Voice Activity Detectors , 2007, IEEE Transactions on Signal Processing.

[11]  Sanjeev Khudanpur,et al.  A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Xiao-Lei Zhang,et al.  Deep Belief Networks Based Voice Activity Detection , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Sven Nordholm,et al.  Statistical Voice Activity Detection Using Low-Variance Spectrum Estimation and an Adaptive Threshold , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Brian Kingsbury,et al.  Improvements to the IBM speech activity detection system for the DARPA RATS program , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Yajie Miao,et al.  Kaldi+PDNN: Building DNN-based ASR Systems with Kaldi and PDNN , 2014, ArXiv.

[16]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[17]  John H. L. Hansen,et al.  Speech activity detection for NASA apollo space missions: challenges and solutions , 2014, INTERSPEECH.

[18]  Javier Ramírez,et al.  Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..

[19]  Thad Hughes,et al.  Recurrent neural networks for voice activity detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Jean-Luc Gauvain,et al.  Minimum word error training of RNN-based voice activity detection , 2015, INTERSPEECH.

[21]  John H. L. Hansen,et al.  Prof-Life-Log: Analysis and classification of activities in daily audio streams , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  John H. L. Hansen,et al.  Curriculum Learning Based Probabilistic Linear Discriminant Analysis for Noise Robust Speaker Recognition , 2017, INTERSPEECH.

[23]  Shih-Chii Liu,et al.  A curriculum learning method for improved noise robustness in automatic speech recognition , 2016, 2017 25th European Signal Processing Conference (EUSIPCO).

[24]  Robert Hooke,et al.  `` Direct Search'' Solution of Numerical and Statistical Problems , 1961, JACM.

[25]  John H. L. Hansen,et al.  Automatic sentiment extraction from YouTube videos , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[26]  Björn W. Schuller,et al.  Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Mark Liberman,et al.  Speech activity detection on youtube using deep neural networks , 2013, INTERSPEECH.

[28]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[29]  John H. L. Hansen,et al.  Curriculum Learning Based Approaches for Noise Robust Speaker Recognition , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30]  Yasunari Obuchi Framewise speech-nonspeech classification by neural networks for voice activity detection with statistical noise suppression , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).