Audiovisual saliency prediction via deep learning

Abstract Neuroscience study verifies that synchronized audiovisual stimuli would make a stronger response of visual perception than an independent stimulus. Many researches show that audio signals would affect human gaze behavior in the viewing of natural video scenes. Thus in this paper, we propose a multi-sensory framework of audio and visual signals for video saliency prediction. It mainly includes four modules: auditory feature extraction, visual feature extraction, semantic interaction between auditory feature and visual feature, and feature fusion. With the inputs of audio and visual signals, we present a network architecture of deep learning to undertake the tasks of these four modules. It is an end-to-end architecture that could interact the semantics from its learned features of audio and visual stimuli. The numerical and visual results show our method achieves a significant improvement over eleven recent saliency models that are regardless of the audio stimuli, even some of them are state-of-the-art deep learning models.

[1]  D. Whitaker,et al.  Sensory uncertainty governs the extent of audio-visual interaction , 2004, Vision Research.

[2]  Faheem Khan,et al.  Speaker separation using visually-derived binary masks , 2013, AVSP.

[3]  Laurent Itti,et al.  Automatic foveation for video compression using a neurobiological model of visual attention , 2004, IEEE Transactions on Image Processing.

[4]  Liqiang Nie,et al.  Neural Multimodal Cooperative Learning Toward Micro-Video Understanding , 2020, IEEE Transactions on Image Processing.

[5]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[6]  A. Coutrot,et al.  An efficient audiovisual saliency model to predict eye positions when looking at conversations , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[7]  Pietro Perona,et al.  Graph-Based Visual Saliency , 2006, NIPS.

[8]  R. Venkatesh Babu,et al.  DeepFix: A Fully Convolutional Neural Network for Predicting Human Eye Fixations , 2015, IEEE Transactions on Image Processing.

[9]  Antoine Coutrot,et al.  An audiovisual attention model for natural conversation scenes , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[10]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[11]  Antoine Coutrot,et al.  Toward the introduction of auditory information in dynamic visual attention models , 2013, 2013 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS).

[12]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[13]  Cristian Sminchisescu,et al.  Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Jianbing Shen,et al.  Deep Visual Attention Prediction. , 2018, IEEE transactions on image processing : a publication of the IEEE Signal Processing Society.

[15]  Zhuowen Tu,et al.  Deeply-Supervised Nets , 2014, AISTATS.

[16]  Nicolas Riche,et al.  Audio-visual attention: Eye-tracking dataset and analysis toolbox , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[17]  A. King,et al.  The superior colliculus , 2004, Current Biology.

[18]  Vaibhava Goel,et al.  Deep multimodal learning for Audio-Visual Speech Recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Dorothea Kolossa,et al.  Audiovisual speech recognition with missing or unreliable data , 2009, AVSP.

[20]  Aykut Erdem,et al.  Spatio-Temporal Saliency Networks for Dynamic Saliency Prediction , 2016, IEEE Transactions on Multimedia.

[21]  S Ullman,et al.  Shifts in selective visual attention: towards the underlying neural circuitry. , 1985, Human neurobiology.

[22]  Hugo Larochelle,et al.  Recurrent Mixture Density Network for Spatiotemporal Visual Attention , 2016, ICLR.

[23]  Aggelos K. Katsaggelos,et al.  Efficient Video Object Segmentation via Network Modulation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Yoav Y. Schechner,et al.  Harmony in Motion , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  A. Coutrot,et al.  How saliency, faces, and sound influence gaze in dynamic social scenes. , 2014, Journal of vision.

[27]  Petros Maragos,et al.  Towards a behaviorally-validated computational audiovisual saliency model , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[29]  Ali Borji,et al.  Revisiting Video Saliency: A Large-Scale Benchmark and a New Model , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Naila Murray,et al.  End-to-End Saliency Mapping via Probability Distribution Prediction , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Antón García-Díaz,et al.  Saliency from hierarchical adaptation through decorrelation and variance normalization , 2012, Image Vis. Comput..

[32]  A. Treisman,et al.  A feature-integration theory of attention , 1980, Cognitive Psychology.

[33]  Christof Koch,et al.  Image Signature: Highlighting Sparse Salient Regions , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Nicolas Riche,et al.  RARE2012: A multi-scale rarity-based saliency detection with its comparative statistical analysis , 2013, Signal Process. Image Commun..

[35]  Rong Li,et al.  Attention region detection based on closure prior in layered bit Planes , 2017, Neurocomputing.

[36]  Jongwook Choi,et al.  Supervising Neural Attention Models for Video Captioning by Human Gaze Data , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Tianming Liu,et al.  Predicting eye fixations using convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Liming Zhang,et al.  A Novel Multiresolution Spatiotemporal Saliency Detection Model and Its Applications in Image and Video Compression , 2010, IEEE Transactions on Image Processing.

[39]  Michael Elad,et al.  Pixels that sound , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[40]  Zhou Wang,et al.  Video saliency incorporating spatiotemporal cues and uncertainty weighting , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[41]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[42]  Duan-Yu Chen,et al.  Preserving Motion-Tolerant Contextual Visual Saliency for Video Resizing , 2013, IEEE Transactions on Multimedia.

[43]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[44]  Ali Borji,et al.  Exploiting local and global patch rarities for saliency detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Michael Dorr,et al.  Large-Scale Optimization of Hierarchical Features for Saliency Prediction in Natural Images , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[47]  Zhou Wang,et al.  Foveation scalable video coding with automatic fixation selection , 2003, IEEE Trans. Image Process..

[48]  Frédéric Berthommier,et al.  A phonetically neutral model of the low-level audio-visual interaction , 2004, Speech Commun..