Improving Classification by Improving Labelling: Introducing Probabilistic Multi-Label Object Interaction Recognition

This work deviates from easy-to-define class boundaries for object interactions. For the task of object interaction recognition, often captured using an egocentric view, we show that semantic ambiguities in verbs and recognising sub-interactions along with concurrent interactions result in legitimate class overlaps (Figure 1). We thus aim to model the mapping between observations and interaction classes, as well as class overlaps, towards a probabilistic multi-label classifier that emulates human annotators. Given a video segment containing an object interaction, we model the probability for a verb, out of a list of possible verbs, to be used to annotate that interaction. The proba- bility is learnt from crowdsourced annotations, and is tested on two public datasets, comprising 1405 video sequences for which we provide annotations on 90 verbs. We outper- form conventional single-label classification by 11% and 6% on the two datasets respectively, and show that learning from annotation probabilities outperforms majority voting and enables discovery of co-occurring labels.

[1]  Martial Hebert,et al.  Temporal segmentation and activity classification from first-person sensing , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[2]  Ivan Laptev,et al.  Unsupervised Learning from Narrated Instruction Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Yangqing Jia,et al.  Deep Convolutional Ranking for Multilabel Image Annotation , 2013, ICLR.

[6]  James M. Rehg,et al.  Learning to Recognize Daily Actions Using Gaze , 2012, ECCV.

[7]  Geoffrey Zweig,et al.  Language Models for Image Captioning: The Quirks and What Works , 2015, ACL.

[8]  Wei Xu,et al.  Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Ryo Kurazume,et al.  First-Person Animal Activity Recognition from Egocentric Videos , 2014, 2014 22nd International Conference on Pattern Recognition.

[10]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[11]  James M. Rehg,et al.  Modeling Actions through State Changes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Kris M. Kitani,et al.  Going Deeper into First-Person Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Dima Damen,et al.  SEMBED: Semantic Embedding of Egocentric Action Videos , 2016, ECCV Workshops.

[14]  Gregory D. Hager,et al.  Segmental Spatiotemporal CNNs for Fine-Grained Action Segmentation , 2016, ECCV.

[15]  Kristen Grauman,et al.  Object-Centric Spatio-Temporal Pyramids for Egocentric Activity Recognition , 2013, BMVC.

[16]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[17]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[20]  Hironobu Takagi,et al.  Recognizing hand-object interactions in wearable camera videos , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[21]  Raja Bala,et al.  On-the-fly hand detection training with application in egocentric action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[22]  Xiaofeng Ren,et al.  Figure-ground segmentation improves handled object recognition in egocentric video , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[23]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[24]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[27]  James M. Rehg,et al.  Delving into egocentric actions , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[29]  Bingbing Ni,et al.  Cascaded Interactional Targeting Network for Egocentric Video Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Qi Wu,et al.  Multilabel Image Classification With Regional Latent Semantic Dependencies , 2016, IEEE Transactions on Multimedia.

[31]  Yu Zhang,et al.  Exploit Bounding Box Annotations for Multi-Label Object Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[34]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[35]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[36]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[37]  Shmuel Peleg,et al.  Compact CNN for indexing egocentric videos , 2015, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[38]  Trevor Darrell,et al.  Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Raymond J. Mooney,et al.  Improving Video Activity Recognition using Object Recognition and Text Mining , 2012, ECAI.

[40]  Jessica K. Hodgins,et al.  Guide to the Carnegie Mellon University Multimodal Activity (CMU-MMAC) Database , 2008 .

[41]  Dima Damen,et al.  You-Do, I-Learn: Discovering Task Relevant Objects and their Modes of Interaction from Multi-User Egocentric Video , 2014, BMVC.

[42]  Wei Xu,et al.  CNN-RNN: A Unified Framework for Multi-label Image Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.