High-Speed Action Recognition and Localization in Compressed Domain Videos

We present a compressed domain scheme that is able to recognize and localize actions at high speeds. The recognition problem is posed as performing an action video query on a test video sequence. Our method is based on computing motion similarity using compressed domain features which can be extracted with low complexity. We introduce a novel motion correlation measure that takes into account differences in motion directions and magnitudes. Our method is appearance-invariant, requires no prior segmentation, alignment or stabilization, and is able to localize actions in both space and time. We evaluated our method on a benchmark action video database consisting of six actions performed by 25 people under three different scenarios. Our proposed method achieved a classification accuracy of 90%, comparing favorably with existing methods in action classification accuracy, and is able to localize a template video of 80 x 64 pixels with 23 frames in a test video of 368 x 184 pixels with 835 frames in just 11 s, easily outperforming other methods in localization speed. We also perform a systematic investigation of the effects of various encoding options on our proposed approach. In particular, we present results on the compression-classification tradeoff, which would provide valuable insight into jointly designing a system that performs video encoding at the camera front-end and action classification at the processing back-end.

[1]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[2]  Robert B. Fisher,et al.  Hidden Markov Models for Optical Flow Analysis in Crowds , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[3]  Gary J. Sullivan,et al.  Rate-constrained coder control and comparison of video coding standards , 2003, IEEE Trans. Circuits Syst. Video Technol..

[4]  Ajay Luthra,et al.  Overview of the H.264/AVC video coding standard , 2003, SPIE Optics + Photonics.

[5]  James W. Davis,et al.  The representation and recognition of human movement using temporal templates , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Shih-Fu Chang,et al.  Compressed-domain techniques for image/video indexing and manipulation , 1995, Proceedings., International Conference on Image Processing.

[7]  Jitendra Malik,et al.  Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[8]  Ajay Luthra,et al.  Overview of the H.264/AVC video coding standard , 2003, IEEE Trans. Circuits Syst. Video Technol..

[9]  S. Shankar Sastry,et al.  Compressed Domain Real-time Action Recognition , 2006, 2006 IEEE Workshop on Multimedia Signal Processing.

[10]  James W. Davis,et al.  The Representation and Recognition of Action Using Temporal Templates , 1997, CVPR 1997.

[11]  S. Shankar Sastry,et al.  Unsupervised Discovery of Action Hierarchies in Large Collections of Activity Videos , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[12]  Bo Shen,et al.  Compressed-Domain Video Processing , 2002 .

[13]  M. Davies,et al.  Approximating optical flow within the MPEG-2 compressed domain , 2005 .

[14]  Thomas Wedi,et al.  Motion- and aliasing-compensated prediction for hybrid video coding , 2003, IEEE Trans. Circuits Syst. Video Technol..

[15]  Faouzi Kossentini,et al.  H.263+: video coding at low bit rates , 1998, IEEE Trans. Circuits Syst. Video Technol..

[16]  R. Venkatesh Babu,et al.  Compressed domain action classification using HMM , 2002, Pattern Recognit. Lett..

[17]  R. L. Baker,et al.  Rate-distortion optimized motion compensation for video compression using fixed or variable size blocks , 1991, IEEE Global Telecommunications Conference GLOBECOM '91: Countdown to the New Millennium. Conference Record.

[18]  Jake K. Aggarwal,et al.  Recognition of Composite Human Activities through Context-Free Grammar Based Representation , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[19]  Jake K. Aggarwal,et al.  Human Motion Analysis: A Review , 1999, Comput. Vis. Image Underst..

[20]  Gary J. Sullivan,et al.  Rate-distortion optimization for video compression , 1998, IEEE Signal Process. Mag..

[21]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[22]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[23]  David J. Fleet,et al.  Performance of optical flow techniques , 1994, International Journal of Computer Vision.

[24]  Wayne H. Wolf,et al.  Human activity detection in MPEG sequences , 2000, Proceedings Workshop on Human Motion.

[25]  Alan N. Willson,et al.  Rate-distortion optimal motion estimation algorithms for motion-compensated transform video coding , 1998, IEEE Trans. Circuits Syst. Video Technol..

[26]  Martial Hebert,et al.  Efficient visual event detection using volumetric features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[27]  Didier Le Gall,et al.  MPEG: a video compression standard for multimedia applications , 1991, CACM.

[28]  Eli Shechtman,et al.  Space-time behavior based correlation , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[29]  Mubarak Shah,et al.  Recognizing human actions in videos acquired by uncalibrated moving cameras , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.