Hallucinating IDT Descriptors and I3D Optical Flow Features for Action Recognition With CNNs

In this paper, we revive the use of old-fashioned handcrafted video representations for action recognition and put new life into these techniques via a CNN-based hallucination step. Despite of the use of RGB and optical flow frames, the I3D model (amongst others) thrives on combining its output with the Improved Dense Trajectory (IDT) and extracted with its low-level video descriptors encoded via Bag-of-Words (BoW) and Fisher Vectors (FV). Such a fusion of CNNs and handcrafted representations is time-consuming due to pre-processing, descriptor extraction, encoding and tuning parameters. Thus, we propose an end-to-end trainable network with streams which learn the IDT-based BoW/FV representations at the training stage and are simple to integrate with the I3D model. Specifically, each stream takes I3D feature maps ahead of the last 1D conv. layer and learns to `translate' these maps to BoW/FV representations. Thus, our model can hallucinate and use such synthesized BoW/FV representations at the testing stage. We show that even features of the entire I3D optical flow stream can be hallucinated thus simplifying the pipeline. Our model saves 20-55h of computations and yields state-of-the-art results on four publicly available datasets.

[1]  Anoop Cherian,et al.  Tensor Representations via Kernel Linearization for Action Recognition from 3D Skeletons , 2016, ECCV.

[2]  Anoop Cherian,et al.  Higher-Order Pooling of CNN Features via Kernel Linearization for Action Recognition , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[3]  Richard P. Wildes,et al.  Spatiotemporal Multiplier Networks for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  C. Schmid,et al.  On the burstiness of visual elements , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Piotr Koniusz,et al.  CNN-based Action Recognition and Supervised Domain Adaptation on 3D Body Skeletons via Kernel Feature Maps , 2018, BMVC.

[6]  Lars Petersson,et al.  Bilinear Attention Networks for Person Retrieval , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[8]  Krystian Mikolajczyk,et al.  Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection , 2013, Comput. Vis. Image Underst..

[9]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[10]  Anoop Cherian,et al.  Generalized Rank Pooling for Activity Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[12]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[13]  Rasmus Pagh,et al.  Fast and scalable polynomial kernels via explicit feature maps , 2013, KDD.

[14]  Cordelia Schmid,et al.  DeepFlow: Large Displacement Optical Flow with Deep Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[15]  Mehrtash Harandi,et al.  Dimensionality Reduction on SPD Manifolds: The Emergence of Geometry-Aware Methods , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[17]  Bernt Schiele,et al.  A database for fine grained activity detection of cooking activities , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  William T. Freeman,et al.  Orientation Histograms for Hand Gesture Recognition , 1995 .

[19]  Richard Nock,et al.  Min-Max Statistical Alignment for Transfer Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[21]  Hui Wang,et al.  Human Action Recognition Using Multi-Velocity STIPs and Motion Energy Orientation Histogram , 2014, J. Inf. Sci. Eng..

[22]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[23]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[24]  Kaiming He,et al.  Long-Term Feature Banks for Detailed Video Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[26]  Lei Wang,et al.  A Comparative Review of Recent Kinect-Based Action Recognition Algorithms , 2019, IEEE Transactions on Image Processing.

[27]  K. Mikolajczyk,et al.  Higher-order Occurrence Pooling on Mid- and Low-level Features: Visual Concept Detection , 2013 .

[28]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Romain Dupont,et al.  A General Dense Image Matching Framework Combining Direct and Feature-Based Costs , 2013, 2013 IEEE International Conference on Computer Vision.

[30]  Jitendra Malik,et al.  Large Displacement Optical Flow: Descriptor Matching in Variational Motion Estimation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Rui Zhang,et al.  Museum Exhibit Identification Challenge for the Supervised Domain Adaptation and Beyond , 2018, ECCV.

[32]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Tinne Tuytelaars,et al.  Modeling video evolution for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Sebastian Thrun,et al.  Is Learning The n-th Thing Any Easier Than Learning The First? , 1995, NIPS.

[35]  Fatih Murat Porikli,et al.  Domain Adaptation by Mixture of Alignments of Second-or Higher-Order Scatter Tensors , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Anoop Cherian,et al.  Non-linear Temporal Subspace Representations for Activity Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Lei Wang,et al.  Loss Switching Fusion with Similarity Search for Video Classification , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[38]  Richard Nock,et al.  Siamese Networks: The Tale of Two Manifolds , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Basura Fernando,et al.  Learning End-to-end Video Classification with Rank-Pooling , 2016, ICML.

[40]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[41]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[42]  Anoop Cherian,et al.  Learning Discriminative Video Representations Using Adversarial Perturbations , 2018, ECCV.

[43]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[44]  Fatih Murat Porikli,et al.  A Deeper Look at Power Normalizations , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[46]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[47]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[48]  Krystian Mikolajczyk,et al.  Higher-Order Occurrence Pooling for Bags-of-Words: Visual Concept Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[51]  John Radford,et al.  Thinking in perspective: Critical essays in the study of thought processes , 1978 .

[52]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[53]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[54]  Krystian Mikolajczyk,et al.  Soft assignment of visual words as Linear Coordinate Coding and optimisation of its reconstruction error , 2011, 2011 18th IEEE International Conference on Image Processing.

[55]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[56]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[57]  Cordelia Schmid,et al.  PoTion: Pose MoTion Representation for Action Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58]  Nicu Sebe,et al.  Realtime Video Classification using Dense HOF/HOG , 2014, ICMR.

[59]  Thomas Brox,et al.  Universität Des Saarlandes Fachrichtung 6.1 – Mathematik Highly Accurate Optic Flow Computation with Theoretically Justified Warping Highly Accurate Optic Flow Computation with Theoretically Justified Warping , 2022 .

[60]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[62]  Thomas B. Moeslund,et al.  Selective spatio-temporal interest points , 2012, Comput. Vis. Image Underst..

[63]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  Cordelia Schmid,et al.  EpicFlow: Edge-preserving interpolation of correspondences for optical flow , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Lior Wolf,et al.  Local Trinary Patterns for human action recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[66]  Joachim Weickert,et al.  Reliable Estimation of Dense Optical Flow Fields with Large Displacements , 2000, International Journal of Computer Vision.

[67]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[68]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[69]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[70]  Tobias Höllerer,et al.  Evaluation of Interest Point Detectors and Feature Descriptors for Visual Tracking , 2011, International Journal of Computer Vision.

[71]  Richard P. Wildes,et al.  Temporal Residual Networks for Dynamic Scene Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).