AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies

Speech activity detection (or endpointing) is an important processing step for applications such as speech recognition, language identification and speaker diarization. Both audio- and vision-based approaches have been used for this task in various settings, often tailored toward end applications. However, much of the prior work reports results in synthetic settings, on task-specific datasets, or on datasets that are not openly available. This makes it difficult to compare approaches and understand their strengths and weaknesses. In this paper, we describe a new dataset which we will release publicly containing densely labeled speech activity in YouTube videos, with the goal of creating a shared, available dataset for this task. The labels in the dataset annotate three different speech activity conditions: clean speech, speech co-occurring with music, and speech co-occurring with noise, which enable analysis of model performance in more challenging conditions based on the presence of overlapping noise. We report benchmark performance numbers on AVA-Speech using off-the-shelf, state-of-the-art audio and vision models that serve as a baseline to facilitate future research.

[1]  Olivier Galibert,et al.  The ETAPE corpus for the evaluation of speech-based TV content processing in the French language , 2012, LREC.

[2]  Mark Liberman,et al.  Speech activity detection on youtube using deep neural networks , 2013, INTERSPEECH.

[3]  Jean Carletta,et al.  The AMI meeting corpus , 2005 .

[4]  Gerhard Widmer,et al.  Improving voice activity detection in movies , 2015, INTERSPEECH.

[5]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Sridha Sridharan,et al.  The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms , 2010, INTERSPEECH.

[7]  Inseon Jang,et al.  Enhanced Feature Extraction for Speech Detection in Media Audio , 2017, INTERSPEECH.

[8]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[9]  Spyridon Matsoukas,et al.  Developing a Speech Activity Detection System for the DARPA RATS Program , 2012, INTERSPEECH.

[10]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[11]  Olivier Galibert,et al.  The REPERE Corpus : a multimodal corpus for person recognition , 2012, LREC.

[12]  Israel Cohen,et al.  Adaptive weighting parameter in audio-visual voice activity detection , 2016, 2016 IEEE International Conference on the Science of Electrical Engineering (ICSEE).

[13]  Tara N. Sainath,et al.  Endpoint Detection Using Grid Long Short-Term Memory Networks for Streaming Speech Recognition , 2017, INTERSPEECH.

[14]  Vaibhava Goel,et al.  Deep multimodal learning for Audio-Visual Speech Recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Israel Cohen,et al.  Audio-Visual Voice Activity Detection Using Diffusion Maps , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Rathinavelu Chengalvarayan,et al.  Robust energy normalization using speech/nonspeech discriminator for German connected digit recognition , 1999, EUROSPEECH.

[20]  Sridha Sridharan,et al.  Complete-linkage clustering for voice activity detection in audio and visual speech , 2015, INTERSPEECH.

[21]  Carlos Busso,et al.  Bimodal Recurrent Neural Network for Audiovisual Voice Activity Detection , 2017, INTERSPEECH.

[22]  Guillaume Gravier,et al.  The ester 2 evaluation campaign for the rich transcription of French radio broadcasts , 2009, INTERSPEECH.

[23]  Malcolm Slaney,et al.  Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers , 2017, ArXiv.

[24]  Joon-Hyuk Chang,et al.  Voice activity detection based on statistical models and machine learning approaches , 2010, Comput. Speech Lang..

[25]  Xavier Anguera Miró,et al.  Robust speaker diarization for meetings: ICSI RT06s evaluation system , 2006, INTERSPEECH.

[26]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[27]  John H. L. Hansen,et al.  Unsupervised Speech Activity Detection Using Voicing Measures and Perceptual Spectral Flux , 2013, IEEE Signal Processing Letters.

[28]  Aristodemos Pnevmatikakis,et al.  Voice activity detection using audio-visual information , 2009, 2009 16th International Conference on Digital Signal Processing.

[29]  Henrik Schulz,et al.  Speaker diarization of broadcast news in Albayzin 2010 evaluation campaign , 2012, EURASIP J. Audio Speech Music. Process..

[30]  Roland Maas,et al.  Domain-Specific Utterance End-Point Detection for Speech Recognition , 2017, INTERSPEECH.

[31]  Won-Ho Shin,et al.  Speech/non-speech classification using multiple features for robust endpoint detection , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[32]  Chungyong Lee,et al.  Robust voice activity detection algorithm for estimating noise spectrum , 2000 .