Learning Hierarchical Models of Complex Daily Activities from Annotated Videos

Effective recognition of complex long-term activities is becoming an increasingly important task in artificial intelligence. In this paper, we propose a novel approach for building models of complex long-term activities. First, we automatically learn the hierarchical structure of activities by learning about the 'parent-child' relation of activity components from a video using the variability in annotations acquired using multiple annotators. This variability allows for extracting the inherent hierarchical structure of the activity in a video. We consolidate hierarchical structures of the same activity from different videos into a unified stochastic grammar describing the overall activity. We then describe an inference mechanism to interpret new instances of activities. We use three datasets, which have been annotated by multiple annotators, of daily activity videos to demonstrate the effectiveness of our system.

[1]  Irfan A. Essa,et al.  Structure from Statistics - Unsupervised Activity Analysis using Suffix Trees , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[2]  Limin Wang,et al.  Computer Vision and Image Understanding Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice , 2022 .

[3]  Jeff A. Bilmes,et al.  Hierarchical Models for Activity Recognition , 2006, 2006 IEEE Workshop on Multimedia Signal Processing.

[4]  Sridhar Mahadevan,et al.  Learning hierarchical models of activity , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[5]  Claudio S. Pinhanez,et al.  Human action detection using PNF propagation of temporal constraints , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[6]  Daniel Roggen,et al.  Tagging human activities in video by crowdsourcing , 2013, ICMR.

[7]  Jonathan Weese,et al.  UMBC_EBIQUITY-CORE: Semantic Textual Similarity Systems , 2013, *SEMEVAL.

[8]  Eren Erdal Aksoy,et al.  Semantic Decomposition and Recognition of Long and Complex Manipulation Action Sequences , 2016, International Journal of Computer Vision.

[9]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[10]  Daniel H. Younger,et al.  Recognition and Parsing of Context-Free Languages in Time n^3 , 1967, Inf. Control..

[11]  Jake K. Aggarwal,et al.  Recognition of Composite Human Activities through Context-Free Grammar Based Representation , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[12]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[13]  Hema Swetha Koppula,et al.  Learning Spatio-Temporal Structure from RGB-D Videos for Human Activity Detection and Anticipation , 2013, ICML.

[14]  Kaiqi Huang,et al.  An Extended Grammar System for Learning and Recognizing Complex Visual Events , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Benjamin Z. Yao,et al.  Learning and parsing video events with goal and intent prediction , 2013, Comput. Vis. Image Underst..

[16]  Jay Earley,et al.  An efficient context-free parsing algorithm , 1970, Commun. ACM.

[17]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[18]  Aaron F. Bobick,et al.  Recognition of Visual Activities and Interactions by Stochastic Parsing , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  James F. Allen Maintaining knowledge about temporal intervals , 1983, CACM.

[20]  Anthony G. Cohn,et al.  Discovering an Event Taxonomy from Video using Qualitative Spatio-temporal Graphs , 2010, ECAI.

[21]  Svetha Venkatesh,et al.  Recognition of human activity through hierarchical stochastic learning , 2003, Proceedings of the First IEEE International Conference on Pervasive Computing and Communications, 2003. (PerCom 2003)..

[22]  Jake K. Aggarwal,et al.  Human Motion Analysis: A Review , 1999, Comput. Vis. Image Underst..

[23]  Tadao Kasami,et al.  An Efficient Recognition and Syntax-Analysis Algorithm for Context-Free Languages , 1965 .

[24]  Anthony G. Cohn,et al.  Qualitative and Quantitative Spatio-temporal Relations in Daily Living Activity Recognition , 2014, ACCV.