HAKE: A Knowledge Engine Foundation for Human Activity Understanding

Human activity understanding is of widespread interest in artificial intelligence and spans diverse applications like health care and behavior analysis. Although there have been advances with deep learning, it remains challenging. The object recognition-like solutions usually try to map pixels to semantics directly, but activity patterns are much different from object patterns, thus hindering another success. In this article, we propose a novel paradigm to reformulate this task in two-stage: first mapping pixels to an intermediate space spanned by atomic activity primitives, then programming detected primitives with interpretable logic rules to infer semantics. To afford a representative primitive space, we build a knowledge base including 26+ M primitive labels and logic rules from human priors or automatic discovering. Our framework, <italic/><bold>H</bold><italic>uman</italic> <bold>A</bold><italic>ctivity</italic> <bold>K</bold><italic>nowledge</italic> <bold>E</bold><italic>ngine</italic> (<italic>HAKE</italic>), exhibits superior generalization ability and performance upon canonical methods on challenging benchmarks. Code and data are available at <uri>http://hake-mvig.cn/</uri>.

[1]  Eun-Sol Kim,et al.  HOTR: End-to-End Human-Object Interaction Detection with Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Jiayuan Mao,et al.  Hierarchical Motion Understanding via Motion Programs , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Tomoaki Yoshinaga,et al.  QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Cewu Lu,et al.  HOI Analysis: Integrating and Decomposing Human-Object Interaction , 2020, NeurIPS.

[5]  Fei-Fei Li,et al.  Illuminating the dark spaces of healthcare with ambient intelligence , 2020, Nature.

[6]  Min Zhang,et al.  Neural Logic Reasoning , 2020, CIKM.

[7]  Y. Qiao,et al.  Visual Compositional Learning for Human-Object Interaction Detection , 2020, ECCV.

[8]  In So Kweon,et al.  Detecting Human-Object Interactions with Action Co-occurrence Priors , 2020, ECCV.

[9]  Cewu Lu,et al.  Detailed 2D-3D Joint Representation for Human-Object Interaction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Cewu Lu,et al.  PaStaNet: Toward Human Activity Knowledge Engine , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jiashi Feng,et al.  PPDM: Parallel Point Detection and Matching for Real-Time Human-Object Interaction Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Pieter Abbeel,et al.  AVID: Learning Multi-Stage Tasks via Pixel-Level Translation of Human Videos , 2019, Robotics: Science and Systems.

[13]  Xuming He,et al.  Pose-Aware Multi-Level Feature Network for Human Object Interaction Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Christopher D. Manning,et al.  Learning by Abstraction: The Neural State Machine , 2019, NeurIPS.

[15]  Kevin Gimpel,et al.  Visually Grounded Neural Syntax Acquisition , 2019, ACL.

[16]  Luciano Serafini,et al.  Neural-Symbolic Computing: An Effective Methodology for Principled Integration of Machine Learning and Reasoning , 2019, FLAP.

[17]  Chong Wang,et al.  Neural Logic Machines , 2019, ICLR.

[18]  J. Tenenbaum,et al.  The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision , 2019, ICLR.

[19]  Cordelia Schmid,et al.  Detecting rare visual relations using analogies , 2018, ArXiv.

[20]  Kaiming He,et al.  Long-Term Feature Banks for Detailed Video Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Cewu Lu,et al.  Transferable Interactiveness Knowledge for Human-Object Interaction Detection , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Chuang Gan,et al.  Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding , 2018, NeurIPS.

[24]  Jiajun Wu,et al.  Learning to Describe Scenes with Programs , 2018, ICLR.

[25]  Song-Chun Zhu,et al.  Learning Human-Object Interactions by Graph Parsing Neural Networks , 2018, ECCV.

[26]  Chen Gao,et al.  iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection , 2018, BMVC.

[27]  Cewu Lu,et al.  Pairwise Body-Part Attention for Recognizing Human-Object Interactions , 2018, ECCV.

[28]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Ivan Toni,et al.  Is the extrastriate body area part of the dorsal visuomotor stream? , 2017, Brain Structure and Function.

[30]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[31]  Tim Rocktäschel,et al.  End-to-end Differentiable Proving , 2017, NIPS.

[32]  Qi Wu,et al.  Care about you: towards large-scale human-centric visual relationship detection , 2017, ArXiv.

[33]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Kaiming He,et al.  Detecting and Recognizing Human-Object Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Jia Deng,et al.  Learning to Detect Human-Object Interactions , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[37]  Hao Su,et al.  Beyond Holistic Object Recognition: Enriching Image Understanding with Part States , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Huimin Ma,et al.  Single Image Action Recognition Using Semantic Body Part Actions , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Cewu Lu,et al.  RMPE: Regional Multi-person Pose Estimation , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[41]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Julian N. Marewski,et al.  What can the brain teach us about building artificial intelligence? , 2016, Behavioral and Brain Sciences.

[43]  J. Pearl,et al.  Causal Inference in Statistics: A Primer , 2016 .

[44]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Jiaxuan Wang,et al.  HICO: A Benchmark for Recognizing Human-Object Interactions in Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[46]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Lin Sun,et al.  Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[48]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Jitendra Malik,et al.  Visual Semantic Role Labeling , 2015, ArXiv.

[52]  Jitendra Malik,et al.  Actions and Attributes from Wholes and Parts , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[53]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[54]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Michael Felsberg,et al.  Semantic Pyramids for Gender and Action Recognition , 2014, IEEE Transactions on Image Processing.

[56]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[57]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[60]  Cordelia Schmid,et al.  Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[61]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[62]  Ivan Laptev,et al.  Learning person-object interactions for action recognition in still images , 2011, NIPS.

[63]  Subhransu Maji,et al.  Action recognition from a distributed representation of pose and appearance , 2011, CVPR 2011.

[64]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[65]  Fei-Fei Li,et al.  Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[66]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[67]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[68]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[69]  I. Biederman Recognition-by-components: a theory of human image understanding. , 1987, Psychological review.

[70]  Donald D. Hoffman,et al.  Parts of recognition , 1984, Cognition.

[71]  Dacheng Tao,et al.  Polysemy Deciphering Network for Human-Object Interaction Detection , 2020, ECCV.

[72]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.