Efficient extraction of spatial relations for extended objects vis-à-vis human activity recognition in video

Human activity recognition (HAR) deals with recognition of activities or interactions that include humans within a video. Entities occurring in a video frame can be abstracted in variety of ways, ranging from the detailed silhouette of the entity to the very basic axis-aligned minimum bounding rectangles (MBR). On one end of the spectrum, using a detailed silhouette is not only demanding in terms of storage and computational resources but is also easily susceptible to noise. On the other end of the spectrum, MBRs require less storage, computation and abstracts out noise and video specific details. However, for abstraction of human bodies in a video an MBR does not offer an adequate solution because in addition to abstracting away noise, it also abstracts out important details such as the posture of the human body. For a more precise description, which offers a reasonable tradeoff between efficiency and noise elimination, a human body can be abstracted using a set of MBRs corresponding to different body parts. However, for a representation of activities as relations between interacting objects, a simplistic approximation assuming each MBR to be an independent entity leads to computation of redundant relations. In this paper, we explore a representation schema for interaction between entities that are considered as sets of rectangles, also referred to as extended objects. We further show that, given the representation schema, a simple recursive algorithm can be used to opportunistically extract topological, directional and distance information in O(nlogn) time. We evaluate our representation schema for HAR on the Mind’s Eye dataset (http://www.visint.org), the UT-Interaction (Ryoo and Aggarwal 2010) dataset and the SBU Kinect Interaction dataset (Yun et al. 2012).

[1]  Stephen Gould,et al.  Efficient Extraction and Representation of Spatial Information from Video Data , 2013, IJCAI.

[2]  Thomas Behr,et al.  Topological relationships between complex spatial objects , 2006, TODS.

[3]  Dimitris Samaras,et al.  Two-person interaction detection using body-pose features and multiple instance learning , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[4]  Anthony G. Cohn,et al.  Towards an Architecture for Cognitive Vision Using Qualitative Spatio-temporal Representations and Abduction , 2003, Spatial Cognition.

[5]  Max J. Egenhofer,et al.  Topological Relations Between Regions with Holes , 1994, Int. J. Geogr. Inf. Sci..

[6]  Tao Gao,et al.  Represent and Infer Human Theory of Mind for Human-Robot Interaction , 2015, AAAI Fall Symposia.

[7]  Eliseo Clementini,et al.  Composite Regions in Topological Queries , 1995, Inf. Syst..

[8]  Anthony G. Cohn,et al.  Qualitative Spatial Representation and Reasoning: An Overview , 2001, Fundam. Informaticae.

[9]  Anthony G. Cohn,et al.  Interleaved Inductive-Abductive Reasoning for Learning Complex Event Models , 2011, ILP.

[10]  Zoe Falomir,et al.  An Ontology for Qualitative Description of Images , 2009 .

[11]  Spiros Skiadopoulos,et al.  On the consistency of cardinal direction constraints , 2005, Artif. Intell..

[12]  Md. Zia Uddin,et al.  A Depth Camera-based Human Activity Recognition via Deep Learning Recurrent Neural Network for Health and Social Care Services , 2016, CENTERIS/ProjMAN/HCist.

[13]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[14]  Maureen Donnelly,et al.  A formal theory of qualitative size and distance relations between regions , 2007 .

[15]  Shyamanta M. Hazarika,et al.  Comprehensive Representation and Efficient Extraction of Spatial Information for Human Activity Recognition from Video Data , 2016, CVIP.

[16]  Jake K. Aggarwal,et al.  An Overview of Contest on Semantic Description of Human Activities (SDHA) 2010 , 2010, ICPR Contests.

[17]  Tsuhan Chen,et al.  Spatio-Temporal Phrases for Activity Recognition , 2012, ECCV.

[18]  Anthony G. Cohn,et al.  Benchmarking qualitative spatial calculi for video activity analysis , 2011 .

[19]  Lihi Zelnik-Manor,et al.  Event-based analysis of video , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[20]  Anthony G. Cohn,et al.  A Spatial Logic based on Regions and Connection , 1992, KR.

[21]  Xiaohui Xie,et al.  Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks , 2016, AAAI.

[22]  Anthony G. Cohn,et al.  Thinking Inside the Box: A Comprehensive Spatial Representation for Video Analysis , 2012, KR.