论文信息 - Towards an Unequivocal Representation of Actions

Towards an Unequivocal Representation of Actions

This work introduces verb-only representations for actions and interactions; the problem of describing similar motions (e.g. 'open door', 'open cupboard'), and distinguish differing ones (e.g. 'open door' vs 'open bottle') using verb-only labels. Current approaches for action recognition neglect legitimate semantic ambiguities and class overlaps between verbs (Fig. 1), relying on the objects to disambiguate interactions. We deviate from single-verb labels and introduce a mapping between observations and multiple verb labels - in order to create an Unequivocal Representation of Actions. The new representation benefits from increased vocabulary and a soft assignment to an enriched space of verb labels. We learn these representations as multi-output regression, using a two-stream fusion CNN. The proposed approach outperforms conventional single-verb labels (also known as majority voting) on three egocentric datasets for both recognition and retrieval.

[1] Jiaxuan Wang,et al. HICO: A Benchmark for Recognizing Human-Object Interactions in Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2] Dima Damen,et al. You-Do, I-Learn: Discovering Task Relevant Objects and their Modes of Interaction from Multi-User Egocentric Video , 2014, BMVC.

[3] Abhinav Gupta,et al. What Actions are Needed for Understanding Human Actions in Videos? , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[5] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.

[6] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[7] C. V. Jawahar,et al. First Person Action Recognition Using Deep Learned Descriptors , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[9] Dima Damen,et al. SEMBED: Semantic Embedding of Egocentric Action Videos , 2016, ECCV Workshops.

[10] Johannes Fürnkranz,et al. Large-Scale Multi-label Text Classification - Revisiting Neural Networks , 2013, ECML/PKDD.

[11] Wei Xu,et al. CNN-RNN: A Unified Framework for Multi-label Image Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Andrew Zisserman,et al. Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Hal Daumé,et al. Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression , 2012, NIPS.

[14] Pietro Perona,et al. Describing Common Human Visual Actions in Images , 2015, BMVC.

[15] James M. Rehg,et al. Learning to Recognize Daily Actions Using Gaze , 2012, ECCV.

[16] Frank Keller,et al. Unsupervised Visual Sense Disambiguation for Verbs using Multimodal Embeddings , 2016, NAACL.

[17] Michael Isard,et al. Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[18] Bingbing Ni,et al. HCP: A Flexible CNN Framework for Multi-Label Image Classification , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19] Dima Damen,et al. Improving Classification by Improving Labelling: Introducing Probabilistic Multi-Label Object Interaction Recognition , 2017, ArXiv.

[20] Larry S. Davis,et al. Walking and talking: A bilinear approach to multi-label action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[21] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[22] Jessica K. Hodgins,et al. Guide to the Carnegie Mellon University Multimodal Activity (CMU-MMAC) Database , 2008 .

[23] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.