Something-Else: Compositional Action Recognition With Spatial-Temporal Interaction Networks

Human action is naturally compositional: humans can easily recognize and perform actions with objects that are different from those used in training demonstrations. In this paper, we study the compositionality of action by looking into the dynamics of subject-object interactions. We propose a novel model which can explicitly reason about the geometric relations between constituent objects and an agent performing an action. To train our model, we collect dense object box annotations on the Something-Something dataset. We propose a novel compositional action recognition task where the training combinations of verbs and nouns do not overlap with the test set. The novel aspects of our model are applicable to activities with prominent object interaction dynamics and to objects which can be tracked using state-of-the-art approaches; for activities without clearly defined spatial object-agent interactions, we rely on baseline scene-level spatio-temporal representations. We show the effectiveness of our approach not only on the proposed compositional action recognition task but also in a few-shot compositional setting which requires the model to generalize across both object appearance and action category.

[1]  Joanna Materzynska,et al.  The Jester Dataset: A Large-Scale Video Dataset of Human Gestures , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[2]  Bolei Zhou,et al.  Reasoning About Human-Object Interactions Through Dual Attention Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Juan Carlos Niebles,et al.  Few-Shot Video Classification via Temporal Alignment , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yu-Chiang Frank Wang,et al.  A Closer Look at Few-shot Classification , 2019, ICLR.

[5]  Subhransu Maji,et al.  Meta-Learning With Differentiable Convex Optimization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Kaiming He,et al.  Long-Term Feature Banks for Detailed Video Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Andrew Zisserman,et al.  Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Trevor Darrell,et al.  Spatio-Temporal Action Graph Networks , 2018, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[10]  Shuicheng Yan,et al.  Graph-Based Global Reasoning Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Yin Li,et al.  Compositional Learning for Human Object Interaction , 2018, ECCV.

[13]  Li Fei-Fei,et al.  Neural Graph Matching Networks for Fewshot 3D Action Recognition , 2018, ECCV.

[14]  Cordelia Schmid,et al.  Actor-Centric Relation Network , 2018, ECCV.

[15]  Abhinav Gupta,et al.  Videos as Space-Time Region Graphs , 2018, ECCV.

[16]  Jessica B. Hamrick,et al.  Relational inductive biases, deep learning, and graph networks , 2018, ArXiv.

[17]  Dima Damen,et al.  Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[18]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning For Video Understanding , 2017, ArXiv.

[19]  Yichen Wei,et al.  Relation Networks for Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[22]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Joan Bruna,et al.  Few-Shot Learning with Graph Neural Networks , 2017, ICLR.

[24]  Xiao Liu,et al.  Revisiting the Effectiveness of Off-the-shelf Temporal Modeling Approaches for Large-scale Video Classification , 2017, ArXiv.

[25]  Martial Hebert,et al.  From Red Wine to Red Tomato: Composition with Context , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[28]  Razvan Pascanu,et al.  Visual Interaction Networks: Learning a Physics Simulator from Video , 2017, NIPS.

[29]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[32]  Li Fei-Fei,et al.  Inferring and Executing Programs for Visual Reasoning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[33]  Kaiming He,et al.  Detecting and Recognizing Human-Object Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Trevor Darrell,et al.  Learning to Reason: End-to-End Module Networks for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[36]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[37]  Xin Pan,et al.  YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Serge J. Belongie,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Leonidas J. Guibas,et al.  Learning Shape Abstractions by Assembling Volumetric Primitives , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Razvan Pascanu,et al.  Interaction Networks for Learning about Objects, Relations and Physics , 2016, NIPS.

[41]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[43]  Hugo Larochelle,et al.  Optimization as a Model for Few-Shot Learning , 2016, ICLR.

[44]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[45]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[46]  Daan Wierstra,et al.  Meta-Learning with Memory-Augmented Neural Networks , 2016, ICML.

[47]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[48]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[49]  Fabio Tozeto Ramos,et al.  Simple online and realtime tracking , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[50]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Silvio Savarese,et al.  Structural-RNN: Deep Learning on Spatio-Temporal Graphs , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Yi Yang,et al.  Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[58]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[61]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[62]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[63]  Yong Jae Lee,et al.  Mid-level Features Improve Recognition of Interactive Activities , 2012 .

[64]  Kate Saenko,et al.  A combined pose, object, and feature model for action understanding , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[65]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[66]  Fei-Fei Li,et al.  Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[67]  A. Gupta,et al.  Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[68]  David A. Forsyth,et al.  Searching for Complex Human Activities with No Visual Examples , 2008, International Journal of Computer Vision.

[69]  Alexei A. Efros,et al.  Using Multiple Segmentations to Discover Objects and their Extent in Image Collections , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[70]  Donald D. Hoffman,et al.  Parts of recognition , 1984, Cognition.

[71]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[72]  Song Han,et al.  Temporal Shift Module for Efficient Video Understanding , 2018, ArXiv.

[73]  R. E. Kalman,et al.  A New Approach to Linear Filtering and Prediction Problems , 2002 .

[74]  A New Approach to Linear Filtering and Prediction Problems , 2002 .

[75]  Ming Yang,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 3d Convolutional Neural Networks for Human Action Recognition , 2022 .