Cross Modal Video Representations for Weakly Supervised Active Speaker Localization
暂无分享,去创建一个
[1] Juan Leon Alcazar,et al. End-to-End Active Speaker Detection , 2022, ECCV.
[2] Fabian Caba Heilbron,et al. MovieCuts: A New Dataset and Benchmark for Cut Type Recognition , 2021, ECCV.
[3] Yuki M. Asano,et al. Self-supervised object detection from audio-visual correspondence , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[4] Vicky S. Kalogeiton,et al. Face, Body, Voice: Video Person-Clustering with Multiple Modalities , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).
[5] Shrikanth S. Narayanan,et al. Computational Media Intelligence: Human-Centered Machine Analysis of Media , 2021, Proceedings of the IEEE.
[6] Bernard Ghanem,et al. MAAS: Multi-modal Assignation for Active Speaker Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[7] Vittorio Murino,et al. S-VVAD: Visual Voice Activity Detection by Motion Segmentation , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).
[8] Bin Wu,et al. A multimodal fake news detection model based on crossmodal attention residual and multichannel convolutional neural networks , 2021, Inf. Process. Manag..
[9] Runhao Zeng,et al. Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization , 2020, ACM Multimedia.
[10] Andrew Owens,et al. Self-Supervised Learning of Audio-Visual Objects from Video , 2020, ECCV.
[11] Chenliang Xu,et al. Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing , 2020, ECCV.
[12] Irene Kotsia,et al. RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[13] Bernard Ghanem,et al. Active Speakers in Context , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[14] Yong Jae Lee,et al. Instance-Aware, Context-Focused, and Memory-Efficient Weakly Supervised Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Pavel Korshunov,et al. Pyannote.Audio: Neural Building Blocks for Speaker Diarization , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[16] Du Tran,et al. What Makes Training Multi-Modal Classification Networks Hard? , 2019, Computer Vision and Pattern Recognition.
[17] Arkadiusz Stopczynski,et al. Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[18] Shrikanth Narayanan,et al. Toward Visual Voice Activity Detection for Unconstrained Videos , 2019, 2019 IEEE International Conference on Image Processing (ICIP).
[19] Joon Son Chung. Naver at ActivityNet Challenge 2019 - Task B Active Speaker Detection (AVA) , 2019, ArXiv.
[20] Shrikanth Narayanan,et al. Robust Speech Activity Detection in Movie Audio: Data Resources and Experimental Evaluation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[21] Chuang Gan,et al. The Sound of Motions , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[22] Chang Liu,et al. C-MIL: Continuation Multiple Instance Learning for Weakly Supervised Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[23] Florian Metze,et al. A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[24] S. Shan,et al. Multi-Task Learning for Audio-Visual Active Speaker Detection , 2019 .
[25] Naveen Kumar,et al. Multimodal Representation of Advertisements Using Segment-level Autoencoders , 2018, ICMI.
[26] Daniel P. W. Ellis,et al. AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies , 2018, INTERSPEECH.
[27] Andrew Owens,et al. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.
[28] Chuang Gan,et al. The Sound of Pixels , 2018, ECCV.
[29] Andrew Zisserman,et al. Objects that Sound , 2017, ECCV.
[30] Anirban Sarkar,et al. Grad-CAM++: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).
[31] Andrew Zisserman,et al. Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[32] Abhinav Gupta,et al. A-Fast-RCNN: Hard Positive Generation via Adversary for Object Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[33] Wenyu Liu,et al. Multiple Instance Detection Network with Online Instance Classifier Refinement , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[34] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[35] Abhishek Das,et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).
[36] Joon Son Chung,et al. Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.
[37] Hugo Van hamme,et al. Active speaker detection with audio-visual co-training , 2016, ICMI.
[38] Fabio Tozeto Ramos,et al. Simple online and realtime tracking , 2016, 2016 IEEE International Conference on Image Processing (ICIP).
[39] Bolei Zhou,et al. Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[40] Andrea Vedaldi,et al. Weakly Supervised Deep Detection Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[41] Dragomir Anguelov,et al. Self-taught object localization with deep networks , 2014, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).
[42] Hugo Van hamme,et al. Who's Speaking?: Audio-Supervised Classification of Active Speakers in Video , 2015, ICMI.
[43] Erica Klarreich,et al. Hello, my name is… , 2014, CACM.
[44] Benjamin Schrauwen,et al. Training and Analysing Deep Recurrent Neural Networks , 2013, NIPS.
[45] Claudia Freigang,et al. Crossmodal interactions and multisensory integration in the perception of audio-visual motion — A free-field study , 2012, Brain Research.
[46] Christopher D. Chambers,et al. Current perspectives and methods in studying neural mechanisms of multisensory interactions , 2012, Neuroscience & Biobehavioral Reviews.
[47] L. Shams,et al. Crossmodal influences on visual perception. , 2010, Physics of life reviews.
[48] Yoav Y. Schechner,et al. Harmony in Motion , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.
[49] Andrew Zisserman,et al. Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.
[50] Trevor Darrell,et al. Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.
[51] Javier R. Movellan,et al. Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.