Context-Associative Hierarchical Memory Model for Human Activity Recognition and Prediction

Human activity recognition is a challenging high-level vision task, for which multiple factors, such as subject, object, and their diverse interactions, have to be considered and modeled. Current learning-based methods are limited in the capability to integrate human-level concepts into an easily extensible computational framework. Inspired by the existing human memory model, we present a context-associative approach to recognize activity with human-object interaction. The proposed system can recognize incoming visual content based on the previous experienced activities. The high-level activity is parsed into consecutive subactivities, and we build a context cluster to model the temporal relations. The semantic attributes of the subactivity are organized by a concept hierarchy. Based on the hierarchy, a series of similarity functions are defined to turn the recognition computing into retrievals over the contextual memory, similar to the auto-associative characteristics of human memory. Partially matching in retrieval and stored memory make the activity prediction possible. The dynamical evolution of the brain memory is mimicked to allow decay and reinforcement of the input information, providing a natural way to maintain data and save computational time. We evaluate our approach on three data sets: CAD-120, MHOI, and OPPORTUNITY. The proposed method demonstrates promising results compared with other state-of-the-art techniques.

[1]  Hema Swetha Koppula,et al.  Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[2]  Bart Selman,et al.  Unstructured human activity detection from RGBD images , 2011, 2012 IEEE International Conference on Robotics and Automation.

[3]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[4]  Walter Daelemans,et al.  Memory-Based Language Processing , 2009, Studies in natural language processing.

[5]  Limin Wang,et al.  Latent Hierarchical Model of Temporal Structure for Complex Activity Classification , 2014, IEEE Transactions on Image Processing.

[6]  Meng Wang,et al.  Neighborhood Discriminant Hashing for Large-Scale Image Retrieval , 2015, IEEE Transactions on Image Processing.

[7]  Larry S. Davis,et al.  AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[8]  Amit K. Roy-Chowdhury,et al.  Context-Aware Modeling and Recognition of Activities in Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Bernd Girod,et al.  A Hybrid Mobile Visual Search System With Compact Global Signatures , 2015, IEEE Transactions on Multimedia.

[10]  Pietro Perona,et al.  One-shot learning of object categories , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Lei Zhang,et al.  Real-Time Compressive Tracking , 2012, ECCV.

[12]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  Ying Wu,et al.  Action recognition with multiscale spatio-temporal contexts , 2011, CVPR 2011.

[14]  Rainer Stiefelhagen,et al.  “Important stuff, everywhere!” Activity recognition with salient proto-objects as context , 2014, IEEE Winter Conference on Applications of Computer Vision.

[15]  G. Mandler Recognizing: The judgment of previous occurrence. , 1980 .

[16]  Michael S. Ryoo,et al.  Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[17]  Jian-Huang Lai,et al.  Exemplar-Based Recognition of Human–Object Interactions , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[18]  G. Sperling A Model for Visual Memory Tasks1 , 1963, Human factors.

[19]  Ling Shao,et al.  Embedding Motion and Structure Features for Action Recognition , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[20]  Fei-Fei Li,et al.  Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[21]  Xiaoqing Ding,et al.  Detecting Human Action as the Spatio-Temporal Tube of Maximum Mutual Information , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[22]  E. Tulving What Is Episodic Memory? , 1993 .

[23]  Yu-Chiang Frank Wang,et al.  Query-Adaptive Multiple Instance Learning for Video Instance Retrieval , 2015, IEEE Transactions on Image Processing.

[24]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[25]  Yun Fu,et al.  ARMA-HMM: A new approach for early recognition of human activity , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[26]  E. Tulving Précis of Elements of episodic memory , 1984, Behavioral and Brain Sciences.

[27]  George Mandler,et al.  From Association to Organization , 2011 .

[28]  M. D’Esposito Working memory. , 2008, Handbook of clinical neurology.

[29]  Jian Pei,et al.  Sequence Data Mining , 2007, Advances in Database Systems.

[30]  Rasmus Berg Palm,et al.  Prediction as a candidate for learning deep hierarchical models of data , 2012 .

[31]  Andrea Cavallaro,et al.  Video-Based Human Behavior Understanding: A Survey , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[32]  J. Wixted Dual-process theory and signal-detection theory of recognition memory. , 2007, Psychological review.

[33]  Yang Wang,et al.  Discriminative Latent Models for Recognizing Contextual Group Activities , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Andrew Zisserman,et al.  Efficient Additive Kernels via Explicit Feature Maps , 2012, IEEE Trans. Pattern Anal. Mach. Intell..

[35]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[36]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[37]  Yun Fu,et al.  Prediction of Human Activity by Discovering Temporal Sequence Patterns , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[39]  Ricardo Chavarriaga,et al.  The Opportunity challenge: A benchmark database for on-body sensor-based activity recognition , 2013, Pattern Recognit. Lett..

[40]  Hema Swetha Koppula,et al.  Learning Spatio-Temporal Structure from RGB-D Videos for Human Activity Detection and Anticipation , 2013, ICML.

[41]  Hao Wang,et al.  ReFinder: A Context-Based Information Refinding System , 2013, IEEE Transactions on Knowledge and Data Engineering.

[42]  Fei-Fei Li,et al.  What, where and who? Classifying events by scene and object recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[43]  Lynne E. Parker,et al.  Fuzzy Temporal Segmentation and Probabilistic Recognition of Continuous Human Daily Activities , 2015, IEEE Transactions on Human-Machine Systems.

[44]  Jintao Li,et al.  Hierarchical spatio-temporal context modeling for action recognition , 2009, CVPR.

[45]  Wei-Shi Zheng,et al.  Learning Person–Person Interaction in Collective Activity Recognition , 2015, IEEE Transactions on Image Processing.

[46]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[47]  Rui Zhang,et al.  Contextual Object Detection With Spatial Context Prototypes , 2014, IEEE Transactions on Multimedia.

[48]  T. Liu,et al.  Human recognition memory and conflict control: An event-related potential study , 2016, Neuroscience.

[49]  Shuicheng Yan,et al.  STAP: Spatial-Temporal Attention-Aware Pooling for Action Recognition , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[50]  Yannis Avrithis,et al.  Using Visual Context and Region Semantics for High-Level Concept Detection , 2009, IEEE Transactions on Multimedia.

[51]  Larry S. Davis,et al.  Objects in Action: An Approach for Combining Action Understanding and Object Perception , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Dong Xu,et al.  Action Recognition Using Multilevel Features and Latent Structural SVM , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[53]  Martin L. Griss,et al.  Towards zero-shot learning for human activity recognition using semantic attribute sequence model , 2013, UbiComp.

[54]  Baoxin Li,et al.  YouTubeCat: Learning to categorize wild web videos , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[55]  Xian-Sheng Hua,et al.  Image Classification With Kernelized Spatial-Context , 2010, IEEE Transactions on Multimedia.

[56]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[57]  Richard C. Atkinson,et al.  Human Memory: A Proposed System and its Control Processes , 1968, Psychology of Learning and Motivation.

[58]  Endel Tulving,et al.  Encoding specificity and retrieval processes in episodic memory. , 1973 .

[59]  Amit K. Roy-Chowdhury,et al.  A Continuous Learning Framework for Activity Recognition Using Deep Hybrid Feature Models , 2015, IEEE Transactions on Multimedia.

[60]  Yi Yang,et al.  Semi-Supervised Multiple Feature Analysis for Action Recognition , 2014, IEEE Transactions on Multimedia.

[61]  Yun Jiang,et al.  Learning Object Arrangements in 3D Scenes using Human Context , 2012, ICML.

[62]  Ling Shao,et al.  One shot learning gesture recognition from RGBD images , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[63]  Pierre Gançarski,et al.  A global averaging method for dynamic time warping, with applications to clustering , 2011, Pattern Recognit..

[64]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[65]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[66]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.