Intel Labs at ActivityNet Challenge 2022: SPELL for Long-Term Active Speaker Detection

In this report, we describe SPELL, a novel spatial-temporal graph learning framework for active speaker detection (ASD). First, each person in a video frame is encoded in a unique node for that frame. The nodes corresponding to each person across frames are connected to encode their temporal dynamics. Nodes within a frame are also connected to encode inter-person relationships. Thus, SPELL reduces ASD to a node classification task. Importantly, SPELL is able to reason over long temporal contexts for all nodes with low computation cost.

[1]  Rohan Kumar Das,et al.  Is Someone Speaking?: Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection , 2021, ACM Multimedia.

[2]  Gerhard Rigoll,et al.  How to Design a Three-Stage Architecture for Audio-Visual Active Speaker Detection in the Wild , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Bernard Ghanem,et al.  MAAS: Multi-modal Assignation for Active Speaker Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Bernard Ghanem,et al.  Active Speakers in Context , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Arkadiusz Stopczynski,et al.  Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Joon Son Chung Naver at ActivityNet Challenge 2019 - Task B Active Speaker Detection (AVA) , 2019, ArXiv.

[7]  Jan Eric Lenssen,et al.  Fast Graph Representation Learning with PyTorch Geometric , 2019, ArXiv.

[8]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Yue Wang,et al.  Dynamic Graph CNN for Learning on Point Clouds , 2018, ACM Trans. Graph..

[10]  Joon Son Chung,et al.  The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.

[11]  Jonas Beskow,et al.  Vision-based Active Speaker Detection in Multiparty Interaction , 2017 .

[12]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[13]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[14]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[15]  Akihiro Sugimoto,et al.  Look who's talking: visual identification of the active speaker in multi-party human-robot interaction , 2016, ASSP4MI '16.

[16]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.