Supplementary Material: AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual active speaker dataset has limited algorithm evaluation in terms of data diversity, environments, and accuracy. In this paper, we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) which has been publicly released to facilitate algorithm development and comparison. It contains temporally labeled face tracks in video, where each face instance is labeled as speaking or not, and whether the speech is audible. This dataset contains about 3.65 million human labeled frames spanning 38.5 hours. We also introduce a state-of-the-art approach for real-time active speaker detection and compare several variants. This evaluation clearly demonstrates a significant gain due to audio-visual modeling and temporal integration over multiple frames.

[1]  Ishwar K. Sethi,et al.  Cross-Modal Analysis of Audio-Visual Programs for Speaker Detection , 2005, 2005 IEEE 7th Workshop on Multimedia Signal Processing.

[2]  Daniel P. W. Ellis,et al.  AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies , 2018, INTERSPEECH.

[3]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[4]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[5]  Olivier Galibert,et al.  The REPERE Corpus : a multimodal corpus for person recognition , 2012, LREC.

[6]  Henrik Schulz,et al.  Speaker diarization of broadcast news in Albayzin 2010 evaluation campaign , 2012, EURASIP J. Audio Speech Music. Process..

[7]  Xavier Anguera Miró,et al.  Robust speaker diarization for meetings: ICSI RT06s evaluation system , 2006, INTERSPEECH.

[8]  Jingwen Dai,et al.  Deep Multimodal Speaker Naming , 2015, ACM Multimedia.

[9]  Malcolm Slaney,et al.  Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers , 2017, ArXiv.

[10]  Chuohao Yeo,et al.  Visual speaker localization aided by acoustic models , 2009, MM '09.

[11]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[12]  Guillaume Gravier,et al.  The ester 2 evaluation campaign for the rich transcription of French radio broadcasts , 2009, INTERSPEECH.

[13]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[14]  Rainer Stiefelhagen,et al.  Semi-supervised Learning with Constraints for Person Identification in Multimedia Data , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Sudeep Sarkar,et al.  Exploring Co-Occurence Between Speech and Body Movement for Audio-Guided Video Localization , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[16]  Ben Taskar,et al.  Movie/Script: Alignment and Parsing of Video and Text Transcription , 2008, ECCV.

[17]  Sileye O. Ba,et al.  Speech/Non-Speech Detection in Meetings from Automatically Extracted low Resolution Visual Features , 2010, ICASSP.

[18]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[19]  Hugo Van hamme,et al.  Who's Speaking?: Audio-Supervised Classification of Active Speakers in Video , 2015, ICMI.

[20]  Hugo Van hamme,et al.  Active speaker detection with audio-visual co-training , 2016, ICMI.

[21]  Paul A. Viola,et al.  Boosting-Based Multimodal Speaker Detection for Distributed Meeting Videos , 2008, IEEE Transactions on Multimedia.

[22]  Alex Graves,et al.  Automated Curriculum Learning for Neural Networks , 2017, ICML.

[23]  Harriet J. Nock,et al.  Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study , 2003, CIVR.

[24]  Sergey Levine,et al.  Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection , 2016, Int. J. Robotics Res..

[25]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[26]  Jonas Beskow,et al.  Vision-based Active Speaker Detection in Multiparty Interaction , 2017 .

[27]  Gang Hua,et al.  A convolutional neural network cascade for face detection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Trevor Darrell,et al.  Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[29]  Radu Horaud,et al.  Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Andrew Zisserman,et al.  Taking the bite out of automated naming of characters in TV video , 2009, Image Vis. Comput..

[31]  Shmuel Peleg,et al.  Visual Speech Enhancement , 2017, INTERSPEECH.

[32]  Koichi Shinoda Speaker adaptation techniques for automatic speech recognition , 2011 .

[33]  Jean Carletta,et al.  The AMI meeting corpus , 2005 .

[34]  Larry S. Davis,et al.  Look who's talking: speaker detection using video and audio correlation , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[35]  Carlos Busso,et al.  Bimodal Recurrent Neural Network for Audiovisual Voice Activity Detection , 2017, INTERSPEECH.

[36]  Malcolm Slaney,et al.  FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks , 2000, NIPS.

[37]  Tinne Tuytelaars,et al.  Cross-Modal Supervision for Learning Active Speaker Detection in Video , 2016, ECCV.

[38]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[40]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[41]  Trevor Darrell,et al.  Visual speech recognition with loosely synchronized feature streams , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[42]  Sudeep Sarkar,et al.  Audio Segmentation and Speaker Localization in Meeting Videos , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[43]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[44]  Joon Son Chung,et al.  The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.