Model-Based Viewpoint Invariant Human Activity Recognition from Uncalibrated Monocular Video Sequence

There is growing interest in human activity recognition systems, motivated by their numerous promising applications in many domains. Despite much progress, most researchers have narrowed the problem towards fixed camera viewpoint owing to inherent difficulty to train their systems across all possible viewpoints. Fixed viewpoint systems are impractical in real scenarios. Therefore, we attempt to relax the fixed viewpoint assumption and present a novel and simple framework to recognize and classify human activities from uncalibrated monocular video source from any viewpoint. The proposed framework comprises two stages: 3D human pose estimation and human activity recognition. In the pose estimation stage, we estimate 3D human pose by a simple search-based and tracking-based technique. In the activity recognition stage, we use Nearest Neighbor, with Dynamic Time Warping as a distance measure, to classify multivariate time series which emanate from streams of pose vectors from multiple video frames. We have performed some experiments to evaluate the accuracy of the two stages separately. The encouraging experimental results demonstrate the effectiveness of our framework.

[1]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[2]  F. Itakura,et al.  Minimum prediction residual principle applied to speech recognition , 1975 .

[3]  Tieniu Tan,et al.  View-invariant action recognition using cross ratios across frames , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).

[4]  Thomas B. Moeslund,et al.  View invariant gesture recognition using 3D motion primitives , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Ankur Agarwal,et al.  Recovering 3D human pose from monocular images , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Christos Faloutsos,et al.  Efficient retrieval of similar time sequences under time warping , 1998, Proceedings 14th International Conference on Data Engineering.

[7]  R. Manmatha,et al.  Lower-Bounding of Dynamic Time Warping Distances for Multivariate Time Series , 2003 .

[8]  Mun Wai Lee,et al.  Human body tracking with auxiliary measurements , 2003, 2003 IEEE International SOI Conference. Proceedings (Cat. No.03CH37443).

[9]  Farzin Mokhtarian,et al.  Image-based shape model for view-invariant human motion recognition , 2007, 2007 IEEE Conference on Advanced Video and Signal Based Surveillance.

[10]  Richard Souvenir,et al.  Learning the viewpoint manifold for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Jinxiang Chai,et al.  Modeling 3D human poses from uncalibrated monocular images , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[12]  Hassan Foroosh,et al.  View-Invariant Action Recognition from Point Triplets , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Alex Waibel,et al.  Readings in speech recognition , 1990 .

[14]  Benito E. Flores,et al.  A pragmatic view of accuracy measurement in forecasting , 1986 .

[15]  Honghai Liu,et al.  Advances in View-Invariant Human Motion Analysis: A Review , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[16]  Eamonn J. Keogh,et al.  Exact indexing of dynamic time warping , 2002, Knowledge and Information Systems.

[17]  Jun-Wei Hsieh,et al.  Human Behavior Analysis Using Deformable Triangulations , 2005, 2005 IEEE 7th Workshop on Multimedia Signal Processing.

[18]  Rémi Ronfard,et al.  Free viewpoint action recognition using motion history volumes , 2006, Comput. Vis. Image Underst..

[19]  Cristian Sminchisescu 3D Human Motion Analysis in Monocular Video Techniques and Challenges , 2006, AVSS.