Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation

Egocentric gaze anticipation serves as a key building block for the emerging capability of Augmented Reality. Notably, gaze behavior is driven by both visual cues and audio signals during daily activities. Motivated by this observation, we introduce the first model that leverages both the video and audio modalities for egocentric gaze anticipation. Specifically, we propose a Contrastive Spatial-Temporal Separable (CSTS) fusion approach that adopts two modules to separately capture audio-visual correlations in spatial and temporal dimensions, and applies a contrastive loss on the re-weighted audio-visual features from fusion modules for representation learning. We conduct extensive ablation studies and thorough analysis using two egocentric video datasets: Ego4D and Aria, to validate our model design. We also demonstrate improvements over prior state-of-the-art methods. Moreover, we provide visualizations to show the gaze anticipation results and provide additional insights into audio-visual representation learning.

[1]  James M. Rehg,et al.  In the Eye of the Beholder: Gaze and Actions in First Person Video , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Mohit Bansal,et al.  Vision Transformers are Parameter-Efficient Audio-Visual Learners , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  James M. Rehg,et al.  In the Eye of Transformer: Global–Local Correlation for Egocentric Gaze Estimation and Beyond , 2022, BMVC.

[4]  Subarna Tripathi,et al.  Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection , 2022, ECCV.

[5]  Mike Zheng Shou,et al.  Egocentric Video-Language Pretraining , 2022, NeurIPS.

[6]  Andrew Owens,et al.  Mix and Localize: Localizing Sound Sources in Mixtures , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Juan Leon Alcazar,et al.  End-to-End Active Speaker Detection , 2022, ECCV.

[8]  James M. Rehg,et al.  Generative Adversarial Network for Future Hand Segmentation from Egocentric Video , 2022, ECCV.

[9]  Lei Xie,et al.  Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement , 2022, IEEE Transactions on Multimedia.

[10]  Junsik Kim,et al.  Learning Sound Localization Better from Semantically Similar Samples , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Vamsi Krishna Ithapu,et al.  Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  James R. Glass,et al.  Everything at Once – Multi-modal Fusion Transformer for Video Retrieval , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  James M. Rehg,et al.  Ego4D: Around the World in 3,000 Hours of Egocentric Video , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Rohan Kumar Das,et al.  Is Someone Speaking?: Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection , 2021, ACM Multimedia.

[15]  Gerhard Rigoll,et al.  How to Design a Three-Stage Architecture for Audio-Visual Active Speaker Detection in the Wild , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Christoph Feichtenhofer,et al.  Multiscale Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Shih-Fu Chang,et al.  VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.

[18]  Andrea Vedaldi,et al.  Localizing Visual Sounds the Hard Way , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Nuno Vasconcelos,et al.  Robust Audio-Visual Instance Discrimination , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Bernard Ghanem,et al.  MAAS: Multi-modal Assignation for Active Speaker Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[22]  N. Vasconcelos,et al.  Audio-Visual Instance Discrimination with Cross-Modal Agreement , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Daniel McDuff,et al.  Active Contrastive Learning of Audio-Visual Video Representations , 2021, ICLR.

[24]  Yi Li,et al.  Learning Representations from Audio-Visual Spatial Alignment , 2020, NeurIPS.

[25]  Weiyao Lin,et al.  Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching , 2020, NeurIPS.

[26]  Weiyao Lin,et al.  Multiple Sound Sources Localization from Coarse to Fine , 2020, ECCV.

[27]  Andrew Zisserman,et al.  Self-Supervised MultiModal Versatile Networks , 2020, NeurIPS.

[28]  Bernard Ghanem,et al.  Active Speakers in Context , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Yoichi Sato,et al.  An Ego-Vision System for Discovering Human Joint Attention , 2020, IEEE Transactions on Human-Machine Systems.

[30]  Geoffrey Zweig,et al.  Multi-modal Self-Supervision from Generalized Data Transformations , 2020, ArXiv.

[31]  P. Maragos,et al.  STAViS: Spatio-Temporal AudioVisual Saliency Network , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  K. Grauman,et al.  Listen to Look: Action Recognition by Previewing Audio , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Yoichi Sato,et al.  Mutual Context Network for Jointly Estimating Egocentric Gaze and Actions , 2019, ArXiv.

[34]  Arkadiusz Stopczynski,et al.  Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Dima Damen,et al.  EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Jiashi Feng,et al.  Anticipating Where People will Look Using Adversarial Networks , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Kristen Grauman,et al.  Co-Separating Sounds of Visual Objects , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Xuelong Li,et al.  Deep Multimodal Clustering for Unsupervised Audiovisual Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[41]  Lorenzo Torresani,et al.  Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.

[42]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[43]  Rogério Schmidt Feris,et al.  Learning to Separate Object Sounds by Watching Unlabeled Video , 2018, ECCV.

[44]  Yoichi Sato,et al.  Predicting Gaze in Egocentric Video by Learning Task-dependent Attention Transition , 2018, ECCV.

[45]  Chenliang Xu,et al.  Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.

[46]  Tae-Hyun Oh,et al.  Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[48]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49]  Qi Zhao,et al.  Deep Future Gaze: Gaze Anticipation on Egocentric Videos Using Adversarial Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[51]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[52]  Hugo Van hamme,et al.  Who's Speaking?: Audio-Supervised Classification of Active Speakers in Video , 2015, ICMI.

[53]  Jianbo Shi,et al.  Social saliency prediction , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  James M. Rehg,et al.  Learning to Predict Gaze in Egocentric Video , 2013, 2013 IEEE International Conference on Computer Vision.

[55]  Mubarak Shah,et al.  Multimodal Analysis for Identification and Segmentation of Moving-Sounding Objects , 2013, IEEE Transactions on Multimedia.

[56]  Samer Al Moubayed,et al.  Towards speaker detection using lips movements for human-machine multiparty dialogue , 2012 .

[57]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[58]  Yoav Y. Schechner,et al.  Harmony in Motion , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Trevor Darrell,et al.  Visual speech recognition with loosely synchronized feature streams , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[60]  Michael Elad,et al.  Pixels that sound , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[61]  D. Ballard,et al.  Eye movements in natural behavior , 2005, Trends in Cognitive Sciences.

[62]  Larry S. Davis,et al.  Look who's talking: speaker detection using video and audio correlation , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[63]  Trevor Darrell,et al.  Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[64]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[65]  K. Schaefer,et al.  ACOUSTIC‐INDUCED EYE MOVEMENTS * , 1981, Annals of the New York Academy of Sciences.