论文信息 - Intel Labs at ActivityNet Challenge 2022: SPELL for Long-Term Active Speaker Detection

Intel Labs at ActivityNet Challenge 2022: SPELL for Long-Term Active Speaker Detection

In this report, we describe SPELL, a novel spatial-temporal graph learning framework for active speaker detection (ASD). First, each person in a video frame is encoded in a unique node for that frame. The nodes corresponding to each person across frames are connected to encode their temporal dynamics. Nodes within a frame are also connected to encode inter-person relationships. Thus, SPELL reduces ASD to a node classification task. Importantly, SPELL is able to reason over long temporal contexts for all nodes with low computation cost.

Subarna Tripathi | Sourya Roy | T. Guha | Somdeb Majumdar | Kyle Min

[1] Rohan Kumar Das,et al. Is Someone Speaking?: Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection , 2021, ACM Multimedia.

[2] Gerhard Rigoll,et al. How to Design a Three-Stage Architecture for Audio-Visual Active Speaker Detection in the Wild , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[3] Bernard Ghanem,et al. MAAS: Multi-modal Assignation for Active Speaker Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[4] Bernard Ghanem,et al. Active Speakers in Context , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Arkadiusz Stopczynski,et al. Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Joon Son Chung. Naver at ActivityNet Challenge 2019 - Task B Active Speaker Detection (AVA) , 2019, ArXiv.

[7] Jan Eric Lenssen,et al. Fast Graph Representation Learning with PyTorch Geometric , 2019, ArXiv.

[8] Chuang Gan,et al. TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9] Yue Wang,et al. Dynamic Graph CNN for Learning on Point Clouds , 2018, ACM Trans. Graph..

[10] Joon Son Chung,et al. The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.

[11] Jonas Beskow,et al. Vision-based Active Speaker Detection in Multiparty Interaction , 2017 .

[12] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[13] Jure Leskovec,et al. Inductive Representation Learning on Large Graphs , 2017, NIPS.

[14] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[15] Akihiro Sugimoto,et al. Look who's talking: visual identification of the active speaker in multi-party human-robot interaction , 2016, ASSP4MI '16.

[16] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.