ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction

We propose the ViNet architecture for audio-visual saliency prediction. ViNet is a fully convolutional encoder-decoder architecture. The encoder uses visual features from a network trained for action recognition, and the decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining features from multiple hierarchies. The overall architecture of ViNet is conceptually simple; it is causal and runs in real-time (60 fps). ViNet does not use audio as input and still outperforms the state-of-the-art audio-visual saliency prediction models on nine different datasets (three visual-only and six audio-visual datasets). ViNet also surpasses human performance on the CC, SIM and AUC metrics for the AVE dataset, and to our knowledge, it is the first model to do so. We also explore a variation of ViNet architecture by augmenting audio features into the decoder. To our surprise, upon sufficient training, the network becomes agnostic to the input audio and provides the same output irrespective of the input. Interestingly, we also observe similar behaviour in the previous state-of-the-art models [1] for audio-visual saliency prediction. Our findings contrast with previous works on deep learning-based audio-visual saliency prediction, suggesting a clear avenue for future explorations incorporating audio in a more effective manner. The code and pre-trained models are available at https://github.com/samyak0210/ViNet.

[1]  Jorge Dias,et al.  Attentional Mechanisms for Socially Interactive Robots–A Survey , 2014, IEEE Transactions on Autonomous Mental Development.

[2]  Santanu Chaudhury,et al.  Visual saliency guided video compression algorithm , 2013, Signal Process. Image Commun..

[3]  P. Maragos,et al.  STAViS: Spatio-Temporal AudioVisual Saliency Network , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Petros Maragos,et al.  SUSiNet: See, Understand and Summarize It , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[5]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  A. Coutrot,et al.  How saliency, faces, and sound influence gaze in dynamic social scenes. , 2014, Journal of vision.

[7]  Kyle Min,et al.  TASED-Net: Temporally-Aggregating Spatial Encoder-Decoder Network for Video Saliency Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Aykut Erdem,et al.  Spatio-Temporal Saliency Networks for Dynamic Saliency Prediction , 2016, IEEE Transactions on Multimedia.

[9]  Xiongkuo Min,et al.  A Multimodal Saliency Model for Videos With High Audio-Visual Correspondence , 2020, IEEE Transactions on Image Processing.

[10]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[11]  Hugo Larochelle,et al.  Recurrent Mixture Density Network for Spatiotemporal Visual Attention , 2016, ICLR.

[12]  Qingshan Liu,et al.  Video Saliency Prediction Using Enhanced Spatiotemporal Alignment Network , 2020, Pattern Recognit..

[13]  Antoine Coutrot,et al.  Influence of soundtrack on eye movements during video exploration , 2012 .

[14]  Sasa Bodiroza,et al.  Evaluating the Effect of Saliency Detection and Attention Manipulation in Human-Robot Interaction , 2013, Int. J. Soc. Robotics.

[15]  John M. Henderson,et al.  Clustering of Gaze During Dynamic Scene Viewing is Predicted by Motion , 2011, Cognitive Computation.

[16]  Rainer Stiefelhagen,et al.  Multimodal saliency-based attention for object-based scene analysis , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[17]  Shanmuganathan Raman,et al.  Facial Expression Recognition Using Visual Saliency and Deep Learning , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[18]  Jan Theeuwes,et al.  Pip and pop: nonspatial auditory signals improve spatial visual search. , 2008, Journal of experimental psychology. Human perception and performance.

[19]  E. Van der Burg,et al.  Audiovisual events capture attention: evidence from temporal order judgments. , 2008, Journal of vision.

[20]  Garrison W. Cottrell,et al.  Visual saliency model for robot cameras , 2008, 2008 IEEE International Conference on Robotics and Automation.

[21]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[22]  Federica Proietto Salanitri,et al.  Video Saliency Detection with Domain Adaptation using Hierarchical Gradient Reversal Layers , 2020, ArXiv.

[23]  Ali Borji,et al.  Revisiting Video Saliency: A Large-Scale Benchmark and a New Model , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Esa Rahtu,et al.  DAVE: A Deep Audio-Visual Embedding for Dynamic Saliency Prediction , 2019 .

[25]  Song Wang,et al.  SalSAC: A Video Saliency Prediction Model with Shuffled Attentions and Correlation-Based ConvLSTM , 2020, AAAI.

[26]  Nanning Zheng,et al.  Visual Saliency Based Object Tracking , 2009, ACCV.

[27]  Vineet Gandhi,et al.  GAZED– Gaze-guided Cinematic Editing of Wide-Angle Monocular Video Recordings , 2020, CHI.

[28]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[29]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Alexandre Bernardino,et al.  Multimodal saliency-based bottom-up attention a framework for the humanoid robot iCub , 2008, 2008 IEEE International Conference on Robotics and Automation.

[31]  Frédo Durand,et al.  What Do Different Evaluation Metrics Tell Us About Saliency Models? , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Qi Zhao,et al.  SALICON: Saliency in Context , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Chenliang Xu,et al.  Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.

[34]  Ivan V. Bajic,et al.  Saliency-Aware Video Compression , 2014, IEEE Transactions on Image Processing.

[35]  Noel E. O'Connor,et al.  Simple vs complex temporal recurrences for video saliency prediction , 2019, BMVC.

[36]  Mohan S. Kankanhalli,et al.  Static saliency vs. dynamic saliency: a comparative study , 2013, ACM Multimedia.

[37]  Antoine Coutrot,et al.  Multimodal Saliency Models for Videos , 2016 .

[38]  Mohan S. Kankanhalli,et al.  Audio Matters in Visual Attention , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[39]  Xiongkuo Min,et al.  Fixation prediction through multimodal analysis , 2015, 2015 Visual Communications and Image Processing (VCIP).

[40]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[41]  Guillermo Sapiro,et al.  SalGaze: Personalizing Gaze Estimation using Visual Saliency , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[42]  C. Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Hanqiu Sun,et al.  Video Saliency Prediction Using Spatiotemporal Residual Attentive Networks , 2020, IEEE Transactions on Image Processing.

[44]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[45]  C. Spence,et al.  Crossmodal binding: Evaluating the “unity assumption” using audiovisual speech stimuli , 2007, Perception & psychophysics.

[46]  Tae-Hyun Oh,et al.  Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Petros Maragos,et al.  A perceptually based spatio-temporal computational framework for visual saliency estimation , 2015, Signal Process. Image Commun..

[48]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[49]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[50]  Wenguan Wang,et al.  Deep Visual Attention Prediction , 2017, IEEE Transactions on Image Processing.

[51]  Haibin Ling,et al.  Revisiting Video Saliency Prediction in the Deep Learning Era , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Zulin Wang,et al.  Predicting Video Saliency with Object-to-Motion CNN and Two-layer Convolutional LSTM , 2017, ECCV.

[53]  Vineet Gandhi,et al.  Tidying Deep Saliency Prediction Architectures , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).