Noise-Aware Video Saliency Prediction

We tackle the problem of predicting saliency maps for videos of dynamic scenes. We note that the accuracy of the maps reconstructed from the gaze data of a fixed number of observers varies with the frame, as it depends on the content of the scene. This issue is particularly pressing when a limited number of observers are available. In such cases, directly minimizing the discrepancy between the predicted and measured saliency maps, as traditional deep-learning methods do, results in overfitting to the noisy data. We propose a noise-aware training (NAT) paradigm that quantifies and accounts for the uncertainty arising from frame-specific gaze data inaccuracy. We show that NAT is especially advantageous when limited training data is available, with experiments across different models, loss functions, and datasets. We also introduce a video game-based saliency dataset, with rich temporal semantics, and multiple gaze attractors per frame. The dataset and source code are available at https://github.com/NVlabs/NAT-saliency.

[1]  Frédo Durand,et al.  What Do Different Evaluation Metrics Tell Us About Saliency Models? , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Qi Zhao,et al.  SALICON: Saliency in Context , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Sen Jia,et al.  EML-NET: An Expandable Multi-Layer NETwork for Saliency Prediction , 2018, Image Vis. Comput..

[4]  R. Venkatesh Babu,et al.  DeepFix: A Fully Convolutional Neural Network for Predicting Human Eye Fixations , 2015, IEEE Transactions on Image Processing.

[5]  Michael Dorr,et al.  Space-Variant Descriptor Sampling for Action Recognition Based on Saliency and Eye Movements , 2012, ECCV.

[6]  Zulin Wang,et al.  DeepVS2.0: A Saliency-Structured Deep Learning Method for Predicting Dynamic Visual Attention , 2020, International Journal of Computer Vision.

[7]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Ling Shao,et al.  Summarize and Search: Learning Consensus-aware Dynamic Convolution for Co-Saliency Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Ali Borji,et al.  CAT2000: A Large Scale Fixation Dataset for Boosting Saliency Research , 2015, ArXiv.

[10]  Matthias Bethge,et al.  Measuring the Importance of Temporal Features in Video Saliency , 2020, ECCV.

[11]  Rachel A. Albert,et al.  Foveated AR: Dynamically-Foveated Augmented Reality Display , 2019 .

[12]  Andrea Palazzi,et al.  DR(eye)VE: A Dataset for Attention-Based Tasks with Applications to Autonomous and Assisted Driving , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[13]  Jia Deng,et al.  RAFT: Recurrent All-Pairs Field Transforms for Optical Flow , 2020, ECCV.

[14]  Peter König,et al.  Measures and Limits of Models of Fixation Selection , 2011, PloS one.

[15]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[16]  Song Wang,et al.  SalSAC: A Video Saliency Prediction Model with Shuffled Attentions and Correlation-Based ConvLSTM , 2020, AAAI.

[17]  Aykut Erdem,et al.  Spatio-Temporal Saliency Networks for Dynamic Saliency Prediction , 2016, IEEE Transactions on Multimedia.

[18]  Vineet Gandhi,et al.  ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction , 2020, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[19]  S Ullman,et al.  Shifts in selective visual attention: towards the underlying neural circuitry. , 1985, Human neurobiology.

[20]  Matthias Bethge,et al.  Deep Gaze I: Boosting Saliency Prediction with Feature Maps Trained on ImageNet , 2014, ICLR.

[21]  Hugo Larochelle,et al.  Recurrent Mixture Density Network for Spatiotemporal Visual Attention , 2016, ICLR.

[22]  Junfeng He,et al.  Accelerating eye movement research via accurate and affordable smartphone eye tracking , 2020, Nature Communications.

[23]  Ali Borji,et al.  State-of-the-Art in Visual Attention Modeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Federica Proietto Salanitri,et al.  Video Saliency Detection with Domain Adaptation using Hierarchical Gradient Reversal Layers , 2020, ArXiv.

[25]  Ali Borji,et al.  Revisiting Video Saliency: A Large-Scale Benchmark and a New Model , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Leon A. Gatys,et al.  Understanding Low- and High-Level Contributions to Fixation Prediction , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Luca Zanni,et al.  A discrepancy principle for Poisson data , 2010 .

[28]  Ali Borji,et al.  Saliency Prediction in the Deep Learning Era: Successes and Limitations , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Tie Liu,et al.  DeepVS: A Deep Learning Based Video Saliency Prediction Approach , 2018, ECCV.

[30]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[31]  Jihoon Ryoo,et al.  SALI360: design and implementation of saliency based video compression for 360° video streaming , 2020, MMSys.

[32]  Matthias Bethge,et al.  Information-theoretic model comparison unifies saliency metrics , 2015, Proceedings of the National Academy of Sciences.

[33]  Nick Barnes,et al.  Learning Noise-Aware Encoder-Decoder from Noisy Labels by Alternating Back-Propagation for Saliency Detection , 2020, ECCV.

[34]  Frédo Durand,et al.  A Benchmark of Computational Models of Saliency to Predict Human Fixations , 2012 .

[35]  Wenguan Wang,et al.  Deep Visual Attention Prediction , 2017, IEEE Transactions on Image Processing.