Zero-Shot Learning on Human-Object Interaction Recognition in Video

Recognition of human activities is an essential field in computer vision. Much of human activities consist of humanobject interaction (HOI). A lot of successful works has done on HOI recognition and achieved acceptable results, but they are fully supervised and need to training labeled data for all HOIs. The space of possible human-object interactions is huge, and listing and providing training data for all categories is costly and impractical. We tackle this problem by proposing an approach for scaling human-object interaction recognition in video data through the zero-shot learning technique. Our method recognizes a verb and an object from video and makes an HOI class. Recognition of the verbs and objects instead of HOIs allows the identification of a new combination of verb an object as a new HOI class that not seen by the recognizer model. We introduce a neural network architecture that can understand video data. The proposed model learns verbs and objects from available training data at the training phase, and at test time can detect the pairs of verb and object in a video, and so identify the HOI class. We evaluated our model by recently introduced charades dataset which has lots of HOI categories in videos. We show that our model can detect unseen HOI classes in addition to the acceptable recognition of seen types. And so more significant number categories are identifiable than the number of training classes.

[1]  Jiaxuan Wang,et al.  HICO: A Benchmark for Recognizing Human-Object Interactions in Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[3]  Rainer Stiefelhagen,et al.  How to Transfer? Zero-Shot Object Recognition via Hierarchical Transfer of Semantic Attributes , 2015, 2015 IEEE Winter Conference on Applications of Computer Vision.

[4]  Li Fei-Fei,et al.  Scaling Human-Object Interaction Recognition Through Zero-Shot Learning , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[5]  Alexander J. Smola,et al.  Compressed Video Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[7]  Shaogang Gong,et al.  Semantic embedding space for zero-shot action recognition , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[8]  Abhinav Gupta,et al.  ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Shaogang Gong,et al.  Zero-shot object recognition by semantic manifold distance , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Fei-Fei Li,et al.  Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[11]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[12]  Martial Hebert,et al.  From Red Wine to Red Tomato: Composition with Context , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Alexei A. Efros,et al.  People Watching: Human Actions as a Cue for Single View Geometry , 2012, International Journal of Computer Vision.

[14]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[15]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[16]  Xiaodong Yu,et al.  Attribute-Based Transfer Learning for Object Categorization with Zero/One Training Example , 2010, ECCV.

[17]  Rama Chellappa,et al.  Zero-Shot Object Detection , 2018, ECCV.

[18]  Ivan Laptev,et al.  Learning person-object interactions for action recognition in still images , 2011, NIPS.

[19]  Abhinav Gupta,et al.  In Defense of the Direct Perception of Affordances , 2015, ArXiv.

[20]  Svetlana Lazebnik,et al.  Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering , 2016, ECCV.

[21]  Christoph H. Lampert,et al.  Detecting Visual Relationships Using Box Attention , 2018, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[22]  Chenliang Xu,et al.  Can humans fly? Action understanding with multiple classes of actors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Alexei A. Efros,et al.  Scene Semantics from Long-Term Observation of People , 2012, ECCV.

[24]  Tao Xiang,et al.  Learning a Deep Embedding Model for Zero-Shot Learning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Kristen Grauman,et al.  Zero-shot recognition with unreliable attributes , 2014, NIPS.

[26]  Samy Bengio,et al.  Zero-Shot Learning by Convex Combination of Semantic Embeddings , 2013, ICLR.

[27]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Larry S. Davis,et al.  Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Frédéric Jurie,et al.  Improving Semantic Embedding Consistency by Metric Learning for Zero-Shot Classiffication , 2016, ECCV.

[30]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Jia Deng,et al.  Learning to Detect Human-Object Interactions , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[32]  Subhransu Maji,et al.  Action recognition from a distributed representation of pose and appearance , 2011, CVPR 2011.

[33]  Ali Farhadi,et al.  Asynchronous Temporal Fields for Action Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Kaiming He,et al.  Detecting and Recognizing Human-Object Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Deva Ramanan,et al.  Detecting Actions, Poses, and Objects with Relational Phraselets , 2012, ECCV.

[36]  Venkatesh Saligrama,et al.  Zero-Shot Learning via Joint Latent Similarity Embedding , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  L. Stark,et al.  Dissertation Abstract , 1994, Journal of Cognitive Education and Psychology.

[38]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[39]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[40]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.