A Sparse Coding Framework for Gaze Prediction in Egocentric Video

To efficiently process and understand a large amount of incoming visual information from first-person perspective (i.e. egocentric vision), predicting human gaze is important. However, even though people continuously gaze in noisy environments, most existing gaze prediction methods mainly use image saliency, which is sensitive to noise in the real-world. To address this issue, we propose a sparse coding-based saliency detection method for gaze prediction. Our model uses a cost function with the 10 norm as a sparse constraint that can control the area of visual saliency in response to the contents of egocentric vision in intuitive and consistent ways. Moreover, we use canonical correlation analysis (CCA) to combine different types of features for reducing noise and the computational complexity. We also utilize the temporal continuity of image frames when defining our saliency. Experiments using a real-world gaze dataset show that our proposed approach outperforms the state-of-the-art algorithms on gaze prediction in egocentric videos.

[1]  Zongben Xu,et al.  L1/2 regularization , 2010, Science China Information Sciences.

[2]  Pietro Perona,et al.  Graph-Based Visual Saliency , 2006, NIPS.

[3]  Nianyi Li,et al.  A weighted sparse coding framework for saliency detection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[5]  Jean-Yves Tourneret,et al.  A hierarchical sparsity-smoothness Bayesian model for ℓ0 + ℓ1 + ℓ2 regularization , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Esa Rahtu,et al.  Fast and Efficient Saliency Detection Using Sparse Sampling and Kernel Density Estimation , 2011, SCIA.

[7]  Takuya Maekawa,et al.  Egocentric Video Search via Physical Interactions , 2016, AAAI.

[8]  Peng Zhang,et al.  Nonsmooth Penalized Clustering via $\ell _{p}$ Regularized Sparse Regression , 2017, IEEE Transactions on Cybernetics.

[9]  Huchuan Lu,et al.  Saliency Detection via Dense and Sparse Reconstruction , 2013, 2013 IEEE International Conference on Computer Vision.

[10]  Yujie Li,et al.  Extracting key frames from first-person videos in the common space of multiple sensors , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[11]  Y. C. Pati,et al.  Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition , 1993, Proceedings of 27th Asilomar Conference on Signals, Systems and Computers.

[12]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[13]  Yao Li,et al.  Contextual Hypergraph Modeling for Salient Object Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[14]  James M. Rehg,et al.  Learning to Recognize Daily Actions Using Gaze , 2012, ECCV.

[15]  James M. Rehg,et al.  Learning to Predict Gaze in Egocentric Video , 2013, 2013 IEEE International Conference on Computer Vision.

[16]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[17]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[18]  Ali Borji,et al.  Exploiting local and global patch rarities for saliency detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Jian Sun,et al.  Saliency Optimization from Robust Background Detection , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Yujie Li,et al.  Key frame extraction from first-person video with multi-sensor integration , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).