Towards an Unequivocal Representation of Actions

This work introduces verb-only representations for actions and interactions; the problem of describing similar motions (e.g. 'open door', 'open cupboard'), and distinguish differing ones (e.g. 'open door' vs 'open bottle') using verb-only labels. Current approaches for action recognition neglect legitimate semantic ambiguities and class overlaps between verbs (Fig. 1), relying on the objects to disambiguate interactions. We deviate from single-verb labels and introduce a mapping between observations and multiple verb labels - in order to create an Unequivocal Representation of Actions. The new representation benefits from increased vocabulary and a soft assignment to an enriched space of verb labels. We learn these representations as multi-output regression, using a two-stream fusion CNN. The proposed approach outperforms conventional single-verb labels (also known as majority voting) on three egocentric datasets for both recognition and retrieval.

[1]  Jiaxuan Wang,et al.  HICO: A Benchmark for Recognizing Human-Object Interactions in Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Dima Damen,et al.  You-Do, I-Learn: Discovering Task Relevant Objects and their Modes of Interaction from Multi-User Egocentric Video , 2014, BMVC.

[3]  Abhinav Gupta,et al.  What Actions are Needed for Understanding Human Actions in Videos? , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[5]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[6]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[7]  C. V. Jawahar,et al.  First Person Action Recognition Using Deep Learned Descriptors , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[9]  Dima Damen,et al.  SEMBED: Semantic Embedding of Egocentric Action Videos , 2016, ECCV Workshops.

[10]  Johannes Fürnkranz,et al.  Large-Scale Multi-label Text Classification - Revisiting Neural Networks , 2013, ECML/PKDD.

[11]  Wei Xu,et al.  CNN-RNN: A Unified Framework for Multi-label Image Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Hal Daumé,et al.  Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression , 2012, NIPS.

[14]  Pietro Perona,et al.  Describing Common Human Visual Actions in Images , 2015, BMVC.

[15]  James M. Rehg,et al.  Learning to Recognize Daily Actions Using Gaze , 2012, ECCV.

[16]  Frank Keller,et al.  Unsupervised Visual Sense Disambiguation for Verbs using Multimodal Embeddings , 2016, NAACL.

[17]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Bingbing Ni,et al.  HCP: A Flexible CNN Framework for Multi-Label Image Classification , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Dima Damen,et al.  Improving Classification by Improving Labelling: Introducing Probabilistic Multi-Label Object Interaction Recognition , 2017, ArXiv.

[20]  Larry S. Davis,et al.  Walking and talking: A bilinear approach to multi-label action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[21]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[22]  Jessica K. Hodgins,et al.  Guide to the Carnegie Mellon University Multimodal Activity (CMU-MMAC) Database , 2008 .

[23]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.