论文信息 - AViNet: Diving Deep into Audio-Visual Saliency Prediction

AViNet: Diving Deep into Audio-Visual Saliency Prediction

We propose the \textbf{AViNet} architecture for audiovisual saliency prediction. AViNet is a fully convolutional encoder-decoder architecture. The encoder combines visual features learned for action recognition, with audio embeddings learned via an aural network designed to classify objects and scenes. The decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining hierarchical features. The overall architecture is conceptually simple, causal, and runs in real-time (60 fps). AViNet outperforms the state-of-the-art on ten (seven audiovisual and three visual-only) datasets while surpassing human performance on the CC, SIM, and AUC metrics for the AVE dataset. Visual features maximally account for saliency on existing datasets with audio-only contributing to minor gains, except in specific contexts like social events. Our work, therefore, motivates the need to curate saliency datasets reflective of real-life, where both the visual and aural modalities complimentarily drive saliency. Our code and pre-trained models are available at this https URL

[1] Qi Zhao,et al. SALICON: Saliency in Context , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Antoine Coutrot,et al. Multimodal Saliency Models for Videos , 2016 .

[3] Thomas Brox,et al. U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[4] John M. Henderson,et al. Clustering of Gaze During Dynamic Scene Viewing is Predicted by Motion , 2011, Cognitive Computation.

[5] Petros Maragos,et al. A perceptually based spatio-temporal computational framework for visual saliency estimation , 2015, Signal Process. Image Commun..

[6] Vineet Gandhi,et al. GAZED Gaze-guided Cinematic Editing of Wide-Angle Monocular Video Recordings , 2020, CHI.

[7] Zulin Wang,et al. Predicting Video Saliency with Object-to-Motion CNN and Two-layer Convolutional LSTM , 2017, ECCV.

[8] Andrew Zisserman,et al. Objects that Sound , 2017, ECCV.

[9] Chenliang Xu,et al. Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.

[10] E. Van der Burg,et al. Audiovisual events capture attention: evidence from temporal order judgments. , 2008, Journal of vision.

[11] Antoine Coutrot,et al. Influence of soundtrack on eye movements during video exploration , 2012 .

[12] Concetto Spampinato,et al. Video Saliency Detection with Domain Adaptation using Hierarchical Gradient Reversal Layers , 2020, ArXiv.

[13] Mohan S. Kankanhalli,et al. Static saliency vs. dynamic saliency: a comparative study , 2013, ACM Multimedia.

[14] Hubert Konik,et al. Predictive Saliency Maps for Surveillance Videos , 2010, 2010 Ninth International Symposium on Distributed Computing and Applications to Business, Engineering and Science.

[15] Vineet Gandhi,et al. Tidying Deep Saliency Prediction Architectures , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[16] Petros Maragos,et al. STAViS: Spatio-Temporal AudioVisual Saliency Network , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Rainer Stiefelhagen,et al. Multimodal saliency-based attention for object-based scene analysis , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[18] Esa Rahtu,et al. DAVE: A Deep Audio-Visual Embedding for Dynamic Saliency Prediction , 2019 .

[19] Wenguan Wang,et al. Deep Visual Attention Prediction , 2017, IEEE Transactions on Image Processing.

[20] Petros Maragos,et al. SUSiNet: See, Understand and Summarize It , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[21] Hugo Larochelle,et al. Recurrent Mixture Density Network for Spatiotemporal Visual Attention , 2016, ICLR.

[22] Noel E. O'Connor,et al. Simple vs complex temporal recurrences for video saliency prediction , 2019, BMVC.

[23] Aykut Erdem,et al. Spatio-Temporal Saliency Networks for Dynamic Saliency Prediction , 2016, IEEE Transactions on Multimedia.

[24] Alexandre Bernardino,et al. Multimodal saliency-based bottom-up attention a framework for the humanoid robot iCub , 2008, 2008 IEEE International Conference on Robotics and Automation.

[25] Song Wang,et al. SalSAC: A Video Saliency Prediction Model with Shuffled Attentions and Correlation-Based ConvLSTM , 2020, AAAI.

[26] A. Coutrot,et al. How saliency, faces, and sound influence gaze in dynamic social scenes. , 2014, Journal of vision.

[27] Subramanian Ramanathan,et al. Watch to Edit: Video Retargeting using Gaze , 2018, Comput. Graph. Forum.

[28] Denis Pellerin,et al. Video summarization using a visual attention model , 2007, 2007 15th European Signal Processing Conference.

[29] Andrew Zisserman,et al. Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30] Andrew Owens,et al. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[31] Ivan V. Bajic,et al. Saliency-Aware Video Compression , 2014, IEEE Transactions on Image Processing.

[32] J. Alison Noble,et al. Unified Image and Video Saliency Modeling , 2020, ECCV.

[33] Tae-Hyun Oh,et al. Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34] Mohan S. Kankanhalli,et al. Audio Matters in Visual Attention , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[35] Xiongkuo Min,et al. Fixation prediction through multimodal analysis , 2015, 2015 Visual Communications and Image Processing (VCIP).

[36] Kyle Min,et al. TASED-Net: Temporally-Aggregating Spatial Encoder-Decoder Network for Video Saliency Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37] Luc Van Gool,et al. Creating Summaries from User Videos , 2014, ECCV.

[38] H. McGurk,et al. Hearing lips and seeing voices , 1976, Nature.

[39] Cordelia Schmid,et al. Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[40] C. Spence,et al. Crossmodal binding: Evaluating the “unity assumption” using audiovisual speech stimuli , 2007, Perception & psychophysics.

[41] Jan Theeuwes,et al. Pip and pop: nonspatial auditory signals improve spatial visual search. , 2008, Journal of experimental psychology. Human perception and performance.

[42] Xiongkuo Min,et al. A Multimodal Saliency Model for Videos With High Audio-Visual Correspondence , 2020, IEEE Transactions on Image Processing.

[43] Antonio Torralba,et al. SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[44] Qingshan Liu,et al. Video Saliency Prediction Using Enhanced Spatiotemporal Alignment Network , 2020, Pattern Recognit..

[45] Ali Borji,et al. Revisiting Video Saliency: A Large-Scale Benchmark and a New Model , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46] Chen Sun,et al. Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.