Hollywood 3D: What are the Best 3D Features for Action Recognition?

Action recognition “in the wild” is extremely challenging, particularly when complex 3D actions are projected down to the image plane, losing a great deal of information. The recent growth of 3D data in broadcast content and commercial depth sensors, makes it possible to overcome this. However, there is little work examining the best way to exploit this new modality. In this paper we introduce the Hollywood 3D benchmark, which is the first dataset containing “in the wild” action footage including 3D data. This dataset consists of 650 stereo video clips across 14 action classes, taken from Hollywood movies. We provide stereo calibrations and depth reconstructions for each clip. We also provide an action recognition pipeline, and propose a number of specialised depth-aware techniques including five interest point detectors and three feature descriptors. Extensive tests allow evaluation of different appearance and depth encoding schemes. Our novel techniques exploiting this depth allow us to reach performance levels more than triple those of the best baseline algorithm using only appearance information. The benchmark data, code and calibrations are all made available to the community.

[1]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Andrew Zisserman,et al.  Action Recognition From Weak Alignment of Body Parts , 2014, BMVC.

[3]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[4]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[5]  Philip H. S. Torr,et al.  Learning discriminative space-time actions from weakly labelled videos , 2012, BMVC.

[6]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[7]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[9]  Alexandros Iosifidis,et al.  Multi-view Regularized Extreme Learning Machine for Human Action Recognition , 2014, SETN.

[10]  Richard Bowden,et al.  Natural Action Recognition Using Invariant 3D Motion Encoding , 2014, ECCV.

[11]  Alexandros Iosifidis,et al.  Human action recognition in stereoscopic videos based on bag of features and disparity pyramids , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[12]  Jiri Matas,et al.  Fixing the Locally Optimized RANSAC , 2012, BMVC.

[13]  Andrew Zisserman,et al.  Multiple view geometry in computer visiond , 2001 .

[14]  Christopher Joseph Pal,et al.  Activity recognition using the velocity histories of tracked keypoints , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[15]  Andrew Gilbert,et al.  Data Mining for Action Recognition , 2014, ACCV.

[16]  Richard Bowden,et al.  Kinecting the dots: Particle based scene flow from depth sensors , 2011, 2011 International Conference on Computer Vision.

[17]  E. Saff,et al.  Distributing many points on a sphere , 1997 .

[18]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[19]  Richard Szeliski,et al.  High-accuracy stereo depth maps using structured light , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[20]  Neil A. Dodgson,et al.  Real-Time Spatiotemporal Stereo Matching Using the Dual-Cross-Bilateral Grid , 2010, ECCV.

[21]  Richard Bowden,et al.  Hollywood 3 D : Recognizing Actions in 3 D Natural Scenes , 2013 .

[22]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[23]  Zicheng Liu,et al.  HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Andrew Gilbert,et al.  Action Recognition Using Mined Hierarchical Compound Features , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Andrew Gilbert,et al.  Capturing the relative distribution of features for action recognition , 2011, Face and Gesture 2011.

[27]  Zuzana Kukelova,et al.  Polynomial Eigenvalue Solutions to the 5-pt and 6-pt Relative Pose Problems , 2008, BMVC.

[28]  Mario Fernando Montenegro Campos,et al.  STOP: Space-Time Occupancy Patterns for 3D Action Recognition from Depth Map Sequences , 2012, CIARP.

[29]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[30]  Wanqing Li,et al.  Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[31]  Alex Pentland,et al.  Coupled hidden Markov models for complex action recognition , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[32]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[33]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[34]  Richard Bowden,et al.  Scene Particles: Unregularized Particle-Based Scene Flow Estimation , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Paul Beaudet,et al.  Rotationally invariant image operators , 1978 .

[36]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Xiaodong Yang,et al.  Recognizing actions using depth motion maps-based histograms of oriented gradients , 2012, ACM Multimedia.

[38]  Patrick Pérez,et al.  Retrieving actions in movies , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[39]  Alexandros Iosifidis,et al.  Stereoscopic video description for human action recognition , 2014, 2014 IEEE Symposium on Computational Intelligence for Multimedia, Signal and Vision Processing (CIMSIVP).

[40]  Richard Bowden,et al.  Scene Flow Estimation using Intelligent Cost Functions , 2014, BMVC.

[41]  Roland Memisevic,et al.  Learning to combine depth and motion , 2013 .

[42]  Jitendra Malik,et al.  Shape matching and object recognition using shape contexts , 2010, 2010 3rd International Conference on Computer Science and Information Technology.

[43]  Andrew Zisserman,et al.  Robust computation and parametrization of multiple view relations , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[44]  Frederic Devernay,et al.  A Variational Method for Scene Flow Estimation from Stereo Sequences , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[45]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[47]  Qi Tian,et al.  Human Daily Action Analysis with Multi-view and Color-Depth Data , 2012, ECCV Workshops.

[48]  Fei-Fei Li,et al.  Action Recognition with Exemplar Based 2.5D Graph Matching , 2012, ECCV.

[49]  Deva Ramanan,et al.  Detecting Actions, Poses, and Objects with Relational Phraselets , 2012, ECCV.

[50]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[51]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[52]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[53]  Richard Bowden,et al.  Hollywood 3D: Recognizing Actions in 3D Natural Scenes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..