Egocentric visual scene description based on human-object interaction and deep spatial relations among objects

Visual Scene interpretation is one of the major areas of research in the recent past. Recognition of human object interaction is a fundamental step towards understanding visual scenes. Videos can be described via a variety of human-object interaction scenarios such as when both human and object are static (static-static), one is static while other is dynamic (static-dynamic) and both are dynamic (dynamic-dynamic). This paper presents a unified framework for the explanation of these interactions between humans and a variety of objects using deep learning as a pivot methodology. Human-object interaction is extracted through native machine learning techniques, while spatial relations are captured by training a model through convolution neural network. We also address the recognition of human posture in detail to provide egocentric visual description. After extracting visual features, sequential minimal optimization is employed for training our model. Extracted inter-action, spatial relations and posture information are fed into natural language generation module along with interacting object label to generate scene understanding. Evaluation of the proposed framework is done for two state of the art datasets i.e., MSCOCO and MSR3D Daily activity dataset; where achieved results are 78 and 91.16% accurate, respectively.

[1]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Cristian Sminchisescu,et al.  The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[3]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[4]  Larry S. Davis,et al.  Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Khan Muhammad,et al.  Hash Based Encryption for Keyframes of Diagnostic Hysteroscopy , 2018, IEEE Access.

[6]  Wanqing Li,et al.  Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[7]  Joseph J. LaViola,et al.  Exploring the Trade-off Between Accuracy and Observational Latency in Action Recognition , 2013, International Journal of Computer Vision.

[8]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Zhihan Lv,et al.  Secure video summarization framework for personalized wireless capsule endoscopy , 2017, Pervasive Mob. Comput..

[10]  Patric Jensfelt,et al.  Topological spatial relations for active visual search , 2012, Robotics Auton. Syst..

[11]  Ajai Kumar,et al.  Knowledge acquisition for language description from scene understanding , 2015, 2015 International Conference on Computer, Communication and Control (IC4).

[12]  Sung Wook Baik,et al.  Secure Surveillance Framework for IoT Systems Using Probabilistic Image Encryption , 2018, IEEE Transactions on Industrial Informatics.

[13]  Bart Selman,et al.  Unstructured human activity detection from RGBD images , 2011, 2012 IEEE International Conference on Robotics and Automation.

[14]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[16]  Sung Wook Baik,et al.  CNN-based anti-spoofing two-tier multi-factor authentication system , 2019, Pattern Recognit. Lett..

[17]  Mark Steedman,et al.  Grounded spatial symbols for task planning based on experience , 2013, 2013 13th IEEE-RAS International Conference on Humanoid Robots (Humanoids).

[18]  Xiaodong Yang,et al.  Effective 3D action recognition using EigenJoints , 2014, J. Vis. Commun. Image Represent..

[19]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[20]  Yi Wang,et al.  Sequential Max-Margin Event Detectors , 2014, ECCV.

[21]  John Folkesson,et al.  Search in the real world: Active visual object search based on spatial relations , 2011, 2011 IEEE International Conference on Robotics and Automation.