Compact and Low-Complexity Binary Feature Descriptor and Fisher Vectors for Video Analytics

In this paper, we propose a compact and low-complexity binary feature descriptor for video analytics. Our binary descriptor encodes the motion information of a spatio-temporal support region into a low-dimensional binary string. The descriptor is based on a binning strategy and a construction that binarizes separately the horizontal and vertical motion components of the spatio-temporal support region. We pair our descriptor with a novel Fisher Vector (FV) scheme for binary data to project a set of binary features into a fixed length vector in order to evaluate the similarity between feature sets. We test the effectiveness of our binary feature descriptor with FVs for action recognition, which is one of the most challenging tasks in computer vision, as well as gait recognition and animal behavior clustering. Several experiments on the KTH, UCF50, UCF101, CASIA-B, and TIGdog datasets show that the proposed binary feature descriptor outperforms the state-of-the-art feature descriptors in terms of computational time and memory and storage requirements. When paired with FVs, the proposed feature descriptor attains a very competitive performance, outperforming several state-of-the-art feature descriptors and some methods based on convolutional neural networks.

[1]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Paul L. Rosin Measuring Corner Properties , 1999, Comput. Vis. Image Underst..

[3]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[4]  Christopher Joseph Pal,et al.  Activity recognition using the velocity histories of tracked keypoints , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[5]  Rahul Sukthankar,et al.  Articulated motion discovery using pairs of trajectories , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[7]  Svetlana Lazebnik,et al.  Iterative quantization: A procrustean approach to learning binary codes , 2011, CVPR 2011.

[8]  Hyunwoo Nam,et al.  Content adaptive video summarization using spatio-temporal features , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[9]  Yunqian Ma,et al.  Event detection using local binary pattern based dynamic textures , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[10]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[11]  Cordelia Schmid,et al.  PoTion: Pose MoTion Representation for Action Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Mubarak Shah,et al.  Classifying web videos using a global video descriptor , 2013, Machine Vision and Applications.

[13]  Kristen Grauman,et al.  Kernelized Locality-Sensitive Hashing , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Limin Wang,et al.  Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice , 2014, Comput. Vis. Image Underst..

[15]  Jie Lin,et al.  A practical guide to CNNs and Fisher Vectors for image instance retrieval , 2015, Signal Process..

[16]  Sajib Saha,et al.  ALOHA: An efficient binary descriptor based on Haar features , 2012, 2012 19th IEEE International Conference on Image Processing.

[17]  Thomas B. Moeslund,et al.  Selective spatio-temporal interest points , 2012, Comput. Vis. Image Underst..

[18]  Jintao Li,et al.  Hierarchical spatio-temporal context modeling for action recognition , 2009, CVPR.

[19]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Ling Shao,et al.  A Wavelet Based Local Descriptor for Human Action Recognition , 2010, BMVC.

[21]  Chang-Tsun Li,et al.  Video Anomaly Detection With Compact Feature Sets for Online Performance , 2017, IEEE Transactions on Image Processing.

[22]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[23]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[24]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[25]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Guillaume-Alexandre Bilodeau,et al.  Efficient Action Recognition with MoFREAK , 2013, 2013 International Conference on Computer and Robot Vision.

[27]  Mubarak Shah,et al.  Recognizing 50 human action categories of web videos , 2012, Machine Vision and Applications.

[28]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[29]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[30]  Roland Siegwart,et al.  BRISK: Binary Robust invariant scalable keypoints , 2011, 2011 International Conference on Computer Vision.

[31]  Xiantong Zhen,et al.  Deep Ensemble Machine for Video Classification , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[32]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[34]  Lin Li,et al.  End-to-end Video-level Representation Learning for Action Recognition , 2017, 2018 24th International Conference on Pattern Recognition (ICPR).

[35]  Tal Hassner,et al.  Motion Interchange Patterns for Action Recognition in Unconstrained Videos , 2012, ECCV.

[36]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Sergio A. Velastin,et al.  How close are we to solving the problem of automated visual surveillance? , 2008, Machine Vision and Applications.

[38]  Gary R. Bradski,et al.  ORB: An efficient alternative to SIFT or SURF , 2011, 2011 International Conference on Computer Vision.

[39]  Cordelia Schmid,et al.  Evaluation of GIST descriptors for web-scale image search , 2009, CIVR '09.

[40]  Florent Perronnin,et al.  Large-scale image retrieval with compressed Fisher vectors , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[41]  Alfons Juan-Císcar,et al.  Bernoulli mixture models for binary images , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[42]  Lior Wolf,et al.  Local Trinary Patterns for human action recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[43]  Tieniu Tan,et al.  Robust view transformation model for gait recognition , 2011, 2011 18th IEEE International Conference on Image Processing.

[44]  Vincent Lepetit,et al.  DAISY: An Efficient Dense Descriptor Applied to Wide-Baseline Stereo , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Wen Gao,et al.  Multiscale Deep Alternative Neural Network for Large-Scale Video Classification , 2018, IEEE Transactions on Multimedia.

[46]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[48]  Ivan Laptev,et al.  Efficient Feature Extraction, Encoding, and Classification for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Yee-Hong Yang,et al.  Robust Volumetric Texture Classification of Magnetic Resonance Images of the Brain Using Local Frequency Descriptor , 2014, IEEE Transactions on Image Processing.

[50]  Yu Qiao,et al.  Action Recognition with Stacked Fisher Vectors , 2014, ECCV.

[51]  IV. F ISHER,et al.  Image Retrieval with Fisher Vectors of Binary Features , 2013 .

[52]  Kristen Grauman,et al.  Kernelized locality-sensitive hashing for scalable image search , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[53]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[54]  Bowen Zhang,et al.  Real-Time Action Recognition with Enhanced Motion Vector CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Rastislav J. R. Struharik,et al.  CoNNA – Compressed CNN Hardware Accelerator , 2018, 2018 21st Euromicro Conference on Digital System Design (DSD).

[56]  Pierre Vandergheynst,et al.  FREAK: Fast Retina Keypoint , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[57]  Martial Hebert,et al.  Trajectons: Action recognition through the motion analysis of tracked features , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[58]  Loong Fah Cheong,et al.  Activity recognition using dense long-duration trajectories , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[59]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[60]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[61]  Svetlana Lazebnik,et al.  Locality-sensitive binary codes from shift-invariant kernels , 2009, NIPS.

[62]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[63]  Bhiksha Raj,et al.  Beyond Gaussian Pyramid: Multi-skip Feature Stacking for action recognition , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Ling Shao,et al.  Transform based spatio-temporal descriptors for human action recognition , 2011, Neurocomputing.

[65]  Chang-Tsun Li,et al.  A fast binary pair-based video descriptor for action recognition , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[66]  Yasushi Makihara,et al.  Cross-view gait recognition using view-dependent discriminative analysis , 2014, IEEE International Joint Conference on Biometrics.

[67]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[68]  P. L. Venetianer,et al.  The evolution of video surveillance: an overview , 2008, Machine Vision and Applications.

[69]  Jian Cheng,et al.  Quantized CNN: A Unified Approach to Accelerate and Compress Convolutional Networks , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[70]  Yap-Peng Tan,et al.  View invariant gait recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[71]  Kejun Wang,et al.  Video-Based Abnormal Human Behavior Recognition—A Review , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[72]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Junsong Yuan,et al.  Abnormal event detection in crowded scenes using sparse representation , 2013, Pattern Recognit..

[74]  Wei Zhang,et al.  Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[75]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[76]  Cisco Visual Networking Index: Forecast and Methodology 2016-2021.(2017) http://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual- networking-index-vni/complete-white-paper-c11-481360.html. High Efficiency Video Coding (HEVC) Algorithms and Architectures https://jvet.hhi.fraunhofer. , 2017 .

[77]  Roberto Leyva,et al.  Fast Binary-Based Video Descriptors for Action Recognition , 2016, 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[78]  Florent Perronnin,et al.  Fisher vectors meet Neural Networks: A hybrid classification architecture , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[79]  Yasushi Makihara,et al.  Gait Recognition Using a View Transformation Model in the Frequency Domain , 2006, ECCV.

[80]  Vincent Lepetit,et al.  BRIEF: Binary Robust Independent Elementary Features , 2010, ECCV.

[81]  Alexandros Iosifidis,et al.  On the kernel Extreme Learning Machine speedup , 2015, Pattern Recognit. Lett..