SEMBED: Semantic Embedding of Egocentric Action Videos

We present SEMBED, an approach for embedding an egocentric object interaction video in a semantic-visual graph to estimate the probability distribution over its potential semantic labels. When object interactions are annotated using unbounded choice of verbs, we embrace the wealth and ambiguity of these labels by capturing the semantic relationships as well as the visual similarities over motion and appearance features. We show how SEMBED can interpret a challenging dataset of 1225 freely annotated egocentric videos, outperforming SVM classification by more than 5 %.

[1]  Latifur Khan,et al.  Image annotations by combining multiple evidence & wordNet , 2005, ACM Multimedia.

[2]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[3]  Dima Damen,et al.  You-Do, I-Learn: Discovering Task Relevant Objects and their Modes of Interaction from Multi-User Egocentric Video , 2014, BMVC.

[4]  Rama Chellappa,et al.  Matrix completion for resolving label ambiguity , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[6]  Trevor Darrell,et al.  YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[7]  Lorenzo Torresani,et al.  Measuring Image Distances via Embedding in a Semantic Manifold , 2012, ECCV.

[8]  C. V. Jawahar,et al.  First Person Action Recognition Using Deep Learned Descriptors , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  James M. Rehg,et al.  Delving into egocentric actions , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Thomas Serre,et al.  Towards a generative approach to activity recognition and segmentation , 2015, ArXiv.

[11]  Kris M. Kitani,et al.  Going Deeper into First-Person Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Martial Hebert,et al.  Source constrained clustering , 2011, 2011 International Conference on Computer Vision.

[13]  Martial Hebert,et al.  Temporal segmentation and activity classification from first-person sensing , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[14]  James M. Rehg,et al.  Learning to recognize objects in egocentric activities , 2011, CVPR 2011.

[15]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[16]  Ana Cristina Murillo,et al.  Experiments on an RGB-D Wearable Vision System for Egocentric Activity Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[17]  Wei Liu,et al.  Predicting Entry-Level Categories , 2015, International Journal of Computer Vision.

[18]  Sethuraman Panchanathan,et al.  Task Prediction in Cooking Activities Using Hierarchical State Space Markov Chain and Object Based Task Grouping , 2010, 2010 IEEE International Symposium on Multimedia.

[19]  Walterio W. Mayol-Cuevas,et al.  Egocentric Visual Event Classification with Location-Based Priors , 2010, ISVC.

[20]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[21]  Eyke Hüllermeier,et al.  Learning from ambiguously labeled examples , 2005, Intell. Data Anal..

[22]  James M. Rehg,et al.  Learning to Recognize Daily Actions Using Gaze , 2012, ECCV.

[23]  Xiaofeng Ren,et al.  Figure-ground segmentation improves handled object recognition in egocentric video , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[24]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[25]  Kristen Grauman,et al.  Object-Centric Spatio-Temporal Pyramids for Egocentric Activity Recognition , 2013, BMVC.

[26]  Takahiro Okabe,et al.  Fast unsupervised ego-action learning for first-person sports videos , 2011, CVPR 2011.

[27]  James M. Rehg,et al.  Modeling Actions through State Changes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Hironobu Takagi,et al.  Recognizing hand-object interactions in wearable camera videos , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[29]  Raja Bala,et al.  On-the-fly hand detection training with application in egocentric action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[30]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Raymond J. Mooney,et al.  Improving Video Activity Recognition using Object Recognition and Text Mining , 2012, ECAI.

[32]  Jessica K. Hodgins,et al.  Guide to the Carnegie Mellon University Multimodal Activity (CMU-MMAC) Database , 2008 .

[33]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[34]  Dima Damen,et al.  Cognitive Learning, Monitoring and Assistance of Industrial Workflows Using Egocentric Sensor Networks , 2015, PloS one.

[35]  Larry H. Matthies,et al.  Pooled motion features for first-person videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  P. Wang,et al.  ATM network: goals and challenges , 1995, CACM.

[37]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[38]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Ivan Laptev,et al.  Unsupervised Learning from Narrated Instruction Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).