Compact CNN for indexing egocentric videos

While egocentric video is becoming increasingly popular, browsing it is very difficult. In this paper we present a compact 3D Convolutional Neural Network (CNN) architecture for long-term activity recognition in egocentric videos. Recognizing long-term activities enables us to temporally segment (index) long and unstructured egocentric videos. Existing methods for this task are based on hand tuned features derived from visible objects, location of hands, as well as optical flow. Given a sparse optical flow volume as input, our CNN classifies the camera wearer's activity. We obtain classification accuracy of 89%, which outperforms the current state-of-the-art by 19%. Additional evaluation is performed on an extended egocentric video dataset, classifying twice the amount of categories than current state-of-the-art. Furthermore, our CNN is able to recognize whether a video is egocentric or not with 99.2% accuracy, up by 24% from current state-of-the-art. To better understand what the network actually learns, we propose a novel visualization of CNN kernels as flow fields.

[1]  Richard Szeliski,et al.  First-person hyper-lapse videos , 2014, ACM Trans. Graph..

[2]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[4]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  James M. Rehg,et al.  Learning to recognize objects in egocentric activities , 2011, CVPR 2011.

[6]  DarrellTrevor,et al.  Long-Term Recurrent Convolutional Networks for Visual Recognition and Description , 2017 .

[7]  Kristen Grauman,et al.  Intentional Photos from an Unintentional Photographer: Detecting Snap Points in Egocentric Video with a Web Photo Prior , 2014, Mobile Cloud Visual Media Computing.

[8]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[9]  Martial Hebert,et al.  Temporal segmentation and activity classification from first-person sensing , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[10]  Takahiro Okabe,et al.  Fast unsupervised ego-action learning for first-person sports videos , 2011, CVPR 2011.

[11]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  James M. Rehg,et al.  Social interactions: A first-person perspective , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[15]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[16]  Lorenzo Torresani,et al.  C3D: Generic Features for Video Analysis , 2014, ArXiv.

[17]  Shmuel Peleg,et al.  Head Motion Signatures from Egocentric Videos , 2014, ACCV.

[18]  James M. Rehg,et al.  Learning to Recognize Daily Actions Using Gaze , 2012, ECCV.

[19]  Shmuel Peleg,et al.  EgoSampling: Fast-forward and stereo for egocentric videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Shmuel Peleg,et al.  Egocentric Video Biometrics , 2014, ArXiv.

[21]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Jonathan Tompson,et al.  MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation , 2014, ACCV.

[23]  Joo-Hwee Lim,et al.  Understanding the Nature of First-Person Videos: Characterization and Classification Using Low-Level Features , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[24]  Shmuel Peleg,et al.  Temporal Segmentation of Egocentric Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Jitendra Malik,et al.  Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[27]  Ali Farhadi,et al.  Understanding egocentric activities , 2011, 2011 International Conference on Computer Vision.

[28]  Sei Naito,et al.  An Attention-Based Activity Recognition for Egocentric Video , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[29]  Xiaofeng Ren,et al.  Figure-ground segmentation improves handled object recognition in egocentric video , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[30]  Michael Cohen,et al.  First-person Hyperlapse Videos , 2014, SIGGRAPH 2014.

[31]  Larry H. Matthies,et al.  Pooled motion features for first-person videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Larry H. Matthies,et al.  First-Person Activity Recognition: What Are They Doing to Me? , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  James M. Rehg,et al.  Learning to Predict Gaze in Egocentric Video , 2013, 2013 IEEE International Conference on Computer Vision.

[34]  Stefan Carlsson,et al.  Novelty detection from an ego-centric perspective , 2011, CVPR 2011.

[35]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  Matthai Philipose,et al.  Egocentric recognition of handled objects: Benchmark and analysis , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[37]  Kristen Grauman,et al.  Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Walterio W. Mayol-Cuevas,et al.  High level activity recognition using low resolution wearable vision , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[39]  Yoichi Sato,et al.  Ego-surfing first person videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Yoichi Sato,et al.  Coupling eye-motion and ego-motion features for first-person activity recognition , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.