Predicting Salient Face in Multiple-Face Videos

Although the recent success of convolutional neural network (CNN) advances state-of-the-art saliency prediction in static images, few work has addressed the problem of predicting attention in videos. On the other hand, we find that the attention of different subjects consistently focuses on a single face in each frame of videos involving multiple faces. Therefore, we propose in this paper a novel deep learning (DL) based method to predict salient face in multiple-face videos, which is capable of learning features and transition of salient faces across video frames. In particular, we first learn a CNN for each frame to locate salient face. Taking CNN features as input, we develop a multiple-stream long short-term memory (M-LSTM) network to predict the temporal transition of salient faces in video sequences. To evaluate our DL-based method, we build a new eye-tracking database of multiple-face videos. The experimental results show that our method outperforms the prior state-of-the-art methods in predicting visual attention on faces in multiple-face videos.

[1]  Yu Fu,et al.  Visual saliency detection by spatially weighted dissimilarity , 2011, CVPR 2011.

[2]  Mohan S. Kankanhalli,et al.  Static saliency vs. dynamic saliency: a comparative study , 2013, ACM Multimedia.

[3]  Christof Koch,et al.  Learning a saliency map using fixated locations in natural scenes. , 2011, Journal of vision.

[4]  Lihi Zelnik-Manor,et al.  Learning Video Saliency from Human Gaze Using Candidate Selection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Laurent Itti,et al.  Realistic avatar eye and head animation using a neurobiological model of visual attention , 2004, SPIE Optics + Photonics.

[6]  Laurent Itti,et al.  Automatic foveation for video compression using a neurobiological model of visual attention , 2004, IEEE Transactions on Image Processing.

[7]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Ali Borji,et al.  State-of-the-Art in Visual Attention Modeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  Qi Zhao,et al.  Saliency in Crowd , 2014, ECCV.

[11]  Zulin Wang,et al.  Bottom-up saliency detection with sparse representation of learnt texture atoms , 2016, Pattern Recognit..

[12]  Weisi Lin,et al.  Video saliency detection in the compressed domain , 2012, ACM Multimedia.

[13]  Li Fei-Fei,et al.  Detecting Events and Key Actors in Multi-person Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  John M. Henderson,et al.  Clustering of Gaze During Dynamic Scene Viewing is Predicted by Motion , 2011, Cognitive Computation.

[15]  Alan C. Bovik,et al.  GAFFE: A Gaze-Attentive Fixation Finding Engine , 2008, IEEE Transactions on Image Processing.

[16]  Yafei Song,et al.  A Data-Driven Metric for Comprehensive Evaluation of Saliency Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Javier R. Movellan,et al.  Optimal scanning for faster object detection , 2009, CVPR.

[18]  Pietro Perona,et al.  Graph-Based Visual Saliency , 2006, NIPS.

[19]  Liming Zhang,et al.  A Novel Multiresolution Spatiotemporal Saliency Detection Model and Its Applications in Image and Video Compression , 2010, IEEE Transactions on Image Processing.

[20]  Martin D. Levine,et al.  Visual Saliency Based on Scale-Space Analysis in the Frequency Domain , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Shiguang Shan,et al.  Funnel-Structured Cascade for Multi-View Face Detection with Alignment-Awareness , 2016, Neurocomputing.

[22]  Pierre Baldi,et al.  Bayesian surprise attracts human attention , 2005, Vision Research.

[23]  Qi Zhao,et al.  SALICON: Reducing the Semantic Gap in Saliency Prediction by Adapting Deep Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Tianming Liu,et al.  Predicting eye fixations using convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  R. Iman,et al.  A distribution-free approach to inducing rank correlation among input variables , 1982 .

[27]  Zulin Wang,et al.  Learning to Predict Saliency on Face Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[29]  Noel E. O'Connor,et al.  Shallow and Deep Convolutional Networks for Saliency Prediction , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[31]  Stan Sclaroff,et al.  Saliency Detection: A Boolean Map Approach , 2013, 2013 IEEE International Conference on Computer Vision.

[32]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[33]  Ivan V. Bajic,et al.  Eye-Tracking Database for a Set of Standard Video Sequences , 2012, IEEE Transactions on Image Processing.

[34]  Matthias Bethge,et al.  DeepGaze II: Reading fixations from deep features trained on object recognition , 2016, ArXiv.

[35]  Kunio Kashino,et al.  A stochastic model of selective visual attention with a dynamic Bayesian network , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[36]  R. Venkatesh Babu,et al.  DeepFix: A Fully Convolutional Neural Network for Predicting Human Eye Fixations , 2015, IEEE Transactions on Image Processing.

[37]  Hu Tian,et al.  A probabilistic saliency model with memory-guided top-down cues for free-viewing , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[38]  Laurent Itti,et al.  Beyond bottom-up: Incorporating task-dependent influences into a computational model of spatial attention , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Christof Koch,et al.  Predicting human gaze using low-level saliency combined with face detection , 2007, NIPS.

[40]  Nuno Vasconcelos,et al.  How many bits does it take for a stimulus to be salient? , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  John K. Tsotsos,et al.  Saliency Based on Information Maximization , 2005, NIPS.

[42]  E. Matin Saccadic suppression: a review and an analysis. , 1974, Psychological bulletin.