Supplementary Material: AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection
暂无分享,去创建一个
Cordelia Schmid | Arkadiusz Stopczynski | Andrew C. Gallagher | Sourish Chaudhuri | Caroline Pantofaru | Zhonghua Xi | Sharadh Ramaswamy | Joseph Roth | Ondrej Klejch | Radhika Marvin | Liat Kaver
[1] Ishwar K. Sethi,et al. Cross-Modal Analysis of Audio-Visual Programs for Speaker Detection , 2005, 2005 IEEE 7th Workshop on Multimedia Signal Processing.
[2] Daniel P. W. Ellis,et al. AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies , 2018, INTERSPEECH.
[3] J. Fleiss. Measuring nominal scale agreement among many raters. , 1971 .
[4] Sergey Levine,et al. End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..
[5] Olivier Galibert,et al. The REPERE Corpus : a multimodal corpus for person recognition , 2012, LREC.
[6] Henrik Schulz,et al. Speaker diarization of broadcast news in Albayzin 2010 evaluation campaign , 2012, EURASIP J. Audio Speech Music. Process..
[7] Xavier Anguera Miró,et al. Robust speaker diarization for meetings: ICSI RT06s evaluation system , 2006, INTERSPEECH.
[8] Jingwen Dai,et al. Deep Multimodal Speaker Naming , 2015, ACM Multimedia.
[9] Malcolm Slaney,et al. Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers , 2017, ArXiv.
[10] Chuohao Yeo,et al. Visual speaker localization aided by acoustic models , 2009, MM '09.
[11] Jason Weston,et al. Curriculum learning , 2009, ICML '09.
[12] Guillaume Gravier,et al. The ester 2 evaluation campaign for the rich transcription of French radio broadcasts , 2009, INTERSPEECH.
[13] Andrew Owens,et al. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.
[14] Rainer Stiefelhagen,et al. Semi-supervised Learning with Constraints for Person Identification in Multimedia Data , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.
[15] Sudeep Sarkar,et al. Exploring Co-Occurence Between Speech and Body Movement for Audio-Guided Video Localization , 2008, IEEE Transactions on Circuits and Systems for Video Technology.
[16] Ben Taskar,et al. Movie/Script: Alignment and Parsing of Video and Text Transcription , 2008, ECCV.
[17] Sileye O. Ba,et al. Speech/Non-Speech Detection in Meetings from Automatically Extracted low Resolution Visual Features , 2010, ICASSP.
[18] Joon Son Chung,et al. Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.
[19] Hugo Van hamme,et al. Who's Speaking?: Audio-Supervised Classification of Active Speakers in Video , 2015, ICMI.
[20] Hugo Van hamme,et al. Active speaker detection with audio-visual co-training , 2016, ICMI.
[21] Paul A. Viola,et al. Boosting-Based Multimodal Speaker Detection for Distributed Meeting Videos , 2008, IEEE Transactions on Multimedia.
[22] Alex Graves,et al. Automated Curriculum Learning for Neural Networks , 2017, ICML.
[23] Harriet J. Nock,et al. Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study , 2003, CIVR.
[24] Sergey Levine,et al. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection , 2016, Int. J. Robotics Res..
[25] J.N. Gowdy,et al. CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.
[26] Jonas Beskow,et al. Vision-based Active Speaker Detection in Multiparty Interaction , 2017 .
[27] Gang Hua,et al. A convolutional neural network cascade for face detection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[28] Trevor Darrell,et al. Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.
[29] Radu Horaud,et al. Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[30] Andrew Zisserman,et al. Taking the bite out of automated naming of characters in TV video , 2009, Image Vis. Comput..
[31] Shmuel Peleg,et al. Visual Speech Enhancement , 2017, INTERSPEECH.
[32] Koichi Shinoda. Speaker adaptation techniques for automatic speech recognition , 2011 .
[33] Jean Carletta,et al. The AMI meeting corpus , 2005 .
[34] Larry S. Davis,et al. Look who's talking: speaker detection using video and audio correlation , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).
[35] Carlos Busso,et al. Bimodal Recurrent Neural Network for Audiovisual Voice Activity Detection , 2017, INTERSPEECH.
[36] Malcolm Slaney,et al. FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks , 2000, NIPS.
[37] Tinne Tuytelaars,et al. Cross-Modal Supervision for Learning Active Speaker Detection in Video , 2016, ECCV.
[38] Cordelia Schmid,et al. AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[39] Bo Chen,et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.
[40] Kevin Wilson,et al. Looking to listen at the cocktail party , 2018, ACM Trans. Graph..
[41] Trevor Darrell,et al. Visual speech recognition with loosely synchronized feature streams , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.
[42] Sudeep Sarkar,et al. Audio Segmentation and Speaker Localization in Meeting Videos , 2006, 18th International Conference on Pattern Recognition (ICPR'06).
[43] Joon Son Chung,et al. VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.
[44] Joon Son Chung,et al. The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.