AViNet: Diving Deep into Audio-Visual Saliency Prediction

We propose the \textbf{AViNet} architecture for audiovisual saliency prediction. AViNet is a fully convolutional encoder-decoder architecture. The encoder combines visual features learned for action recognition, with audio embeddings learned via an aural network designed to classify objects and scenes. The decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining hierarchical features. The overall architecture is conceptually simple, causal, and runs in real-time (60 fps). AViNet outperforms the state-of-the-art on ten (seven audiovisual and three visual-only) datasets while surpassing human performance on the CC, SIM, and AUC metrics for the AVE dataset. Visual features maximally account for saliency on existing datasets with audio-only contributing to minor gains, except in specific contexts like social events. Our work, therefore, motivates the need to curate saliency datasets reflective of real-life, where both the visual and aural modalities complimentarily drive saliency. Our code and pre-trained models are available at this https URL

[1]  Qi Zhao,et al.  SALICON: Saliency in Context , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Antoine Coutrot,et al.  Multimodal Saliency Models for Videos , 2016 .

[3]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[4]  John M. Henderson,et al.  Clustering of Gaze During Dynamic Scene Viewing is Predicted by Motion , 2011, Cognitive Computation.

[5]  Petros Maragos,et al.  A perceptually based spatio-temporal computational framework for visual saliency estimation , 2015, Signal Process. Image Commun..

[6]  Vineet Gandhi,et al.  GAZED– Gaze-guided Cinematic Editing of Wide-Angle Monocular Video Recordings , 2020, CHI.

[7]  Zulin Wang,et al.  Predicting Video Saliency with Object-to-Motion CNN and Two-layer Convolutional LSTM , 2017, ECCV.

[8]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[9]  Chenliang Xu,et al.  Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.

[10]  E. Van der Burg,et al.  Audiovisual events capture attention: evidence from temporal order judgments. , 2008, Journal of vision.

[11]  Antoine Coutrot,et al.  Influence of soundtrack on eye movements during video exploration , 2012 .

[12]  Concetto Spampinato,et al.  Video Saliency Detection with Domain Adaptation using Hierarchical Gradient Reversal Layers , 2020, ArXiv.

[13]  Mohan S. Kankanhalli,et al.  Static saliency vs. dynamic saliency: a comparative study , 2013, ACM Multimedia.

[14]  Hubert Konik,et al.  Predictive Saliency Maps for Surveillance Videos , 2010, 2010 Ninth International Symposium on Distributed Computing and Applications to Business, Engineering and Science.

[15]  Vineet Gandhi,et al.  Tidying Deep Saliency Prediction Architectures , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[16]  Petros Maragos,et al.  STAViS: Spatio-Temporal AudioVisual Saliency Network , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Rainer Stiefelhagen,et al.  Multimodal saliency-based attention for object-based scene analysis , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[18]  Esa Rahtu,et al.  DAVE: A Deep Audio-Visual Embedding for Dynamic Saliency Prediction , 2019 .

[19]  Wenguan Wang,et al.  Deep Visual Attention Prediction , 2017, IEEE Transactions on Image Processing.

[20]  Petros Maragos,et al.  SUSiNet: See, Understand and Summarize It , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[21]  Hugo Larochelle,et al.  Recurrent Mixture Density Network for Spatiotemporal Visual Attention , 2016, ICLR.

[22]  Noel E. O'Connor,et al.  Simple vs complex temporal recurrences for video saliency prediction , 2019, BMVC.

[23]  Aykut Erdem,et al.  Spatio-Temporal Saliency Networks for Dynamic Saliency Prediction , 2016, IEEE Transactions on Multimedia.

[24]  Alexandre Bernardino,et al.  Multimodal saliency-based bottom-up attention a framework for the humanoid robot iCub , 2008, 2008 IEEE International Conference on Robotics and Automation.

[25]  Song Wang,et al.  SalSAC: A Video Saliency Prediction Model with Shuffled Attentions and Correlation-Based ConvLSTM , 2020, AAAI.

[26]  A. Coutrot,et al.  How saliency, faces, and sound influence gaze in dynamic social scenes. , 2014, Journal of vision.

[27]  Subramanian Ramanathan,et al.  Watch to Edit: Video Retargeting using Gaze , 2018, Comput. Graph. Forum.

[28]  Denis Pellerin,et al.  Video summarization using a visual attention model , 2007, 2007 15th European Signal Processing Conference.

[29]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[31]  Ivan V. Bajic,et al.  Saliency-Aware Video Compression , 2014, IEEE Transactions on Image Processing.

[32]  J. Alison Noble,et al.  Unified Image and Video Saliency Modeling , 2020, ECCV.

[33]  Tae-Hyun Oh,et al.  Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Mohan S. Kankanhalli,et al.  Audio Matters in Visual Attention , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[35]  Xiongkuo Min,et al.  Fixation prediction through multimodal analysis , 2015, 2015 Visual Communications and Image Processing (VCIP).

[36]  Kyle Min,et al.  TASED-Net: Temporally-Aggregating Spatial Encoder-Decoder Network for Video Saliency Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[38]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[39]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  C. Spence,et al.  Crossmodal binding: Evaluating the “unity assumption” using audiovisual speech stimuli , 2007, Perception & psychophysics.

[41]  Jan Theeuwes,et al.  Pip and pop: nonspatial auditory signals improve spatial visual search. , 2008, Journal of experimental psychology. Human perception and performance.

[42]  Xiongkuo Min,et al.  A Multimodal Saliency Model for Videos With High Audio-Visual Correspondence , 2020, IEEE Transactions on Image Processing.

[43]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[44]  Qingshan Liu,et al.  Video Saliency Prediction Using Enhanced Spatiotemporal Alignment Network , 2020, Pattern Recognit..

[45]  Ali Borji,et al.  Revisiting Video Saliency: A Large-Scale Benchmark and a New Model , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.