Monocular viewpoint invariant human activity recognition

One of the grand goals of robotics is to have assistive robots living side-by-side with humans, autonomously assisting humans in everyday activities. To be able to interact with humans and assist them, robots must be able to understand and interpret human activities. There is a growing interest in the problem of human activity recognition. Despite much progress, most computer vision researchers have narrowed the problem towards fixed camera viewpoint owing to inherent difficulty to train their systems across all possible viewpoints. However, since the robots and humans are free to move around in the environment, the viewpoint of a robot with respect to a person varies all the time. Therefore, we attempt to relax the infamous fixed viewpoint assumption and present a novel and efficient framework to recognize and classify human activities from monocular video source from arbitrary viewpoint. The proposed framework comprises of two stages: human pose recognition and human activity recognition. In the pose recognition stage, an ensemble of pose models performs inference on each video frame. Each pose model estimates the probability that the given frame contains the corresponding pose. Over a sequence of frames, each pose model forms a time series. In the activity recognition stage, we use nearest neighbor, with dynamic time warping as a distance measure, to classify pose time series. We have built a small-scale proof-of-concept model and performed some experiments on three publicly available datasets. The satisfactory experimental results demonstrate the efficacy of our framework and encourage us to further develop a full-scale architecture.

[1]  Ralph Gross,et al.  The CMU Motion of Body (MoBo) Database , 2001 .

[2]  IEEE 6th International Conference on Robotics, Automation and Mechatronics, RAM 2013, Manila, Philippines, November 12-15, 2013 , 2013, RAM.

[3]  Farzin Mokhtarian,et al.  Image-based shape model for view-invariant human motion recognition , 2007, 2007 IEEE Conference on Advanced Video and Signal Based Surveillance.

[4]  Peter J. van Otterloo,et al.  A contour-oriented approach to shape analysis , 1991 .

[5]  Richard Souvenir,et al.  Learning the viewpoint manifold for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.

[7]  Eamonn J. Keogh,et al.  Exact indexing of dynamic time warping , 2002, Knowledge and Information Systems.

[8]  J. Hawkins,et al.  On Intelligence , 2004 .

[9]  Rémi Ronfard,et al.  Automatic Discovery of Action Taxonomies from Multiple Views , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[10]  Rémi Ronfard,et al.  Action Recognition from Arbitrary Views using 3D Exemplars , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[11]  Ramakant Nevatia,et al.  Single View Human Action Recognition using Key Pose Matching and Viterbi Path Searching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Hassan Foroosh,et al.  View-Invariant Action Recognition from Point Triplets , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Ramakant Nevatia,et al.  View and scale invariant action recognition using multiview shape-flow models , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  E. R. Davies,et al.  Machine vision - theory, algorithms, practicalities , 2004 .

[15]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[16]  Pavel Senin,et al.  Dynamic Time Warping Algorithm Review , 2008 .

[17]  Takayuki Kanda,et al.  Psychological analysis on human-robot interaction , 2001, Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No.01CH37164).

[18]  Honghai Liu,et al.  A New Framework for View-Invariant Human Action Recognition , 2010 .

[19]  Jun-Wei Hsieh,et al.  Human Behavior Analysis Using Deformable Triangulations , 2005, 2005 IEEE 7th Workshop on Multimedia Signal Processing.

[20]  Thomas B. Moeslund,et al.  View invariant gesture recognition using 3D motion primitives , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Rémi Ronfard,et al.  Free viewpoint action recognition using motion history volumes , 2006, Comput. Vis. Image Underst..

[22]  Guojun Lu,et al.  Review of shape representation and description techniques , 2004, Pattern Recognit..

[23]  Christos Faloutsos,et al.  Efficient retrieval of similar time sequences under time warping , 1998, Proceedings 14th International Conference on Data Engineering.

[24]  R. Manmatha,et al.  Lower-Bounding of Dynamic Time Warping Distances for Multivariate Time Series , 2003 .

[25]  Cristian Sminchisescu 3D Human Motion Analysis in Monocular Video Techniques and Challenges , 2006, AVSS.

[26]  Ankur Agarwal,et al.  Recovering 3D human pose from monocular images , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Nitin Kumar,et al.  Time-series Bitmaps: a Practical Visualization Tool for Working with Large Time Series Databases , 2005, SDM.

[28]  Honghai Liu,et al.  Advances in View-Invariant Human Motion Analysis: A Review , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[29]  G. W. Hughes,et al.  Minimum Prediction Residual Principle Applied to Speech Recognition , 1975 .

[30]  Patrick Pérez,et al.  Retrieving actions in movies , 2007, 2007 IEEE 11th International Conference on Computer Vision.