Interactive Fusion of Multi-level Features for Compositional Activity Recognition

To understand a complex action, multiple sources of information, including appearance, positional, and semantic features, need to be integrated. However, these features are difficult to be fused since they often differ significantly in modality and dimensionality. In this paper, we present a novel framework that accomplishes this goal by interactive fusion, namely, projecting features across different spaces and guiding it using an auxiliary prediction task. Specifically, we implement the framework in three steps, namely, positional-to-appearance feature extraction, semantic feature interaction, and semantic-to-positional prediction. We evaluate our approach on two action recognition datasets, Something-Something and Charades. Interactive fusion achieves consistent accuracy gain beyond off-the-shelf action recognition algorithms. In particular, on Something-Else, the compositional setting of Something-Something, interactive fusion reports a remarkable gain of 2.9% in terms of top-1 accuracy.

[1]  J. Andrew Bagnell,et al.  Approximate MaxEnt Inverse Optimal Control and Its Application for Mental Simulation of Human Interactions , 2015, AAAI.

[2]  Martial Hebert,et al.  Activity Forecasting , 2012, ECCV.

[3]  Cordelia Schmid,et al.  Actor-Centric Relation Network , 2018, ECCV.

[4]  Marc'Aurelio Ranzato,et al.  Video (language) modeling: a baseline for generative models of natural videos , 2014, ArXiv.

[5]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Yue Zhao,et al.  Intra- and Inter-Action Understanding via Temporal Action Parsing , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yue Zhao,et al.  FineGym: A Hierarchical Video Dataset for Fine-Grained Action Understanding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[9]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[10]  Asim Kadav,et al.  Attend and Interact: Higher-Order Object Interactions for Video Understanding , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Abhinav Gupta,et al.  Interpretable Intuitive Physics Model , 2018, ECCV.

[12]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Juan Carlos Niebles,et al.  Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Deva Ramanan,et al.  CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning , 2020, ICLR.

[15]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[17]  Andrew Zisserman,et al.  Video Representation Learning by Dense Predictive Coding , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[18]  Trevor Darrell,et al.  Something-Else: Compositional Action Recognition With Spatial-Temporal Interaction Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jiajun Wu,et al.  Physics 101: Learning Physical Object Properties from Unlabeled Videos , 2016, BMVC.

[20]  Christian Wolf,et al.  Object Level Visual Reasoning in Videos , 2018, ECCV.

[21]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[22]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[23]  Jinhui Tang,et al.  Social Adaptive Module for Weakly-supervised Group Activity Recognition , 2020, ECCV.

[24]  Juan Carlos Niebles,et al.  Learning to Decompose and Disentangle Representations for Video Prediction , 2018, NeurIPS.

[25]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[26]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[27]  Vighnesh Birodkar,et al.  Unsupervised Learning of Disentangled Representations from Video , 2017, NIPS.

[28]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[32]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[33]  Abhinav Gupta,et al.  Videos as Space-Time Region Graphs , 2018, ECCV.

[34]  Jinhui Tang,et al.  HiGCIN: Hierarchical Graph-Based Cross Inference Network for Group Activity Recognition , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Yi Ma,et al.  Learning Long-term Visual Dynamics with Region Proposal Interaction Networks , 2020, ICLR.

[36]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[37]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[38]  Andrew Zisserman,et al.  Memory-augmented Dense Predictive Coding for Video Representation Learning , 2020, ECCV.