Joint detection and recognition of human actions in wireless surveillance camera networks

Automatic recognition of human actions in video has been a highly addressed problem in robotics and computer vision. Majority of the recent work in literature has focused on classifying pre-segmented video clips, and some progress has also been made on joint detection and recognition of actions in complex video sequences. These methods, however, are not designed for wireless camera networks where the sensors have limited internal processing and communication capabilities. In this paper we present an efficient system for the joint detection and recognition of human actions using a network of wireless smart cameras. The foundation of our work is based on Deformable Part Models (DPMs) for detecting objects in static images. We have extended this framework to the single-view and multi-view video setting to jointly detect and recognize actions. We call this the Deformable Keyframe Model (DKM) and tightly integrate it within a centralized video analysis system. In our system, feature extraction is locally performed on-board wireless smart cameras, and the classification is performed at a base station with higher processing power. Our analysis demonstrates that this decoupling of the the recognition pipeline can significantly minimize the power and bandwidth consumed by the wireless cameras. We experimentally validate our DKMs on two data sets. We first demonstrate the competitiveness of our algorithm by comparing its performance against other state-of-the-art methods, on a publicly available dataset. Then, we extensively validate our system on a novel dataset called the Bosch Multiview Complex Action (BMCA) dataset. Our dataset consists of 11 actions continuously performed by 20 different subjects while being captured by cameras located at 4 different vantage points. In our experiments, we demonstrate that the presence of multiple-views improves the performance of action detection and recognition by about 15% over the single-view setting.

[1]  Yang Wang,et al.  Discriminative figure-centric models for joint action localization and recognition , 2011, 2011 International Conference on Computer Vision.

[2]  Cristian Sminchisescu,et al.  Conditional Random Fields for Contextual Human Motion Recognition , 2005, ICCV.

[3]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[4]  Michael I. Jordan,et al.  Nonparametric Bayesian Learning of Switching Linear Dynamical Systems , 2008, NIPS.

[5]  Martial Hebert,et al.  Volumetric Features for Video Event Detection , 2010, International Journal of Computer Vision.

[6]  Cristian Sminchisescu,et al.  Conditional models for contextual human motion recognition , 2006, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[7]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[8]  Patrick Pérez,et al.  Retrieving actions in movies , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[9]  Mubarak Shah,et al.  Human Action Recognition in Videos Using Kinematic Features and Multiple Instance Learning , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[11]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[15]  Ajay Luthra,et al.  Overview of the H.264/AVC video coding standard , 2003, IEEE Trans. Circuits Syst. Video Technol..

[16]  Gregory D. Hager,et al.  Sparse Hidden Markov Models for Surgical Gesture Classification and Skill Evaluation , 2012, IPCAI.

[17]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[18]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[19]  Rémi Ronfard,et al.  Action Recognition from Arbitrary Views using 3D Exemplars , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[20]  Fernando De la Torre,et al.  Joint segmentation and classification of human actions in video , 2011, CVPR 2011.

[21]  Chen Wu,et al.  Multiview activity recognition in smart homes with spatio-temporal features , 2010, ICDSC '10.

[22]  Zoran Zivkovic,et al.  Improved adaptive Gaussian mixture model for background subtraction , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[23]  Rémi Ronfard,et al.  Free viewpoint action recognition using motion history volumes , 2006, Comput. Vis. Image Underst..

[24]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[25]  S. Shankar Sastry,et al.  Using Models of Objects with Deformable Parts for Joint Categorization and Segmentation of Objects , 2012, ACCV.

[26]  Li Wang,et al.  Discriminative human action segmentation and recognition using semi-Markov model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Martial Hebert,et al.  Modeling the Temporal Extent of Actions , 2010, ECCV.

[28]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Mubarak Shah,et al.  Spatiotemporal Deformable Part Models for Action Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Greg Mori,et al.  Action recognition by learning mid-level motion features , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Allen Y. Yang,et al.  Towards an efficient distributed object recognition system in wireless smart camera networks , 2010, 2010 13th International Conference on Information Fusion.

[32]  S. Shankar Sastry,et al.  An Invitation to 3-D Vision: From Images to Geometric Models , 2003 .

[33]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[34]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories , 2006 .

[36]  Ying Wang,et al.  Multi-view Gymnastic Activity Recognition with Fused HMM , 2007, ACCV.