GPU-FV: Realtime Fisher Vector and Its Applications in Video Monitoring

Fisher vector has been widely used in many multimedia retrieval and visual recognition applications with good performance. However, the computation complexity prevents its usage in real-time video monitoring. In this work, we proposed and implemented GPU-FV, a fast Fisher vector extraction method with the help of modern GPUs. The challenge of implementing Fisher vector on GPUs lies in the data dependency in feature extraction and expensive memory access in Fisher vector computing. To handle these challenges, we carefully designed GPU-FV in a way that utilizes the computing power of GPU as much as possible, and applied optimizations such as loop tiling to boost the performance. GPU-FV is about 12 times faster than the CPU version, and 50\% faster than a non-optimized GPU implementation. For standard video input (320*240), GPU-FV can process each frame within 34ms on a model GPU. Our experiments show that GPU-FV obtains a similar recognition accuracy as traditional FV on VOC 2007 and Caltech 256 image sets. We also applied GPU-FV for realtime video monitoring tasks and found that GPU-FV outperforms a number of previous works. Especially, when the number of training examples are small, GPU-FV outperforms the recent popular deep CNN features borrowed from ImageNet.

[1]  Ramakant Nevatia,et al.  ISOMER: Informative Segment Observations for Multimedia Event Recounting , 2014, ICMR.

[2]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[3]  Venkatesh Saligrama,et al.  Video anomaly detection based on local statistical aggregates , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Sharath Pankanti,et al.  Spatio-temporal fisher vector coding for surveillance event detection , 2013, ACM Multimedia.

[5]  Ankur Datta,et al.  Efficient Maximum Appearance Search for Large-Scale Object Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Koen E. A. van de Sande,et al.  Fisher and VLAD with FLAIR , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Cordelia Schmid,et al.  Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[8]  Jan-Michael Frahm,et al.  Feature tracking and matching in video using programmable graphics hardware , 2007, Machine Vision and Applications.

[9]  A. Cheriyadat,et al.  GPU Accelerated Textons and Dense SIFT Features for Human Settlement Detection from High-Resolution Satellite Imagery , 2015 .

[10]  L. Kratz,et al.  Anomaly detection in extremely crowded scenes using spatio-temporal motion pattern models , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Fei-Fei Li,et al.  Online detection of unusual events in videos via dynamic sparse coding , 2011, CVPR 2011.

[12]  Mubarak Shah,et al.  Abnormal crowd behavior detection using social force model , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[14]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[15]  Shengen Yan,et al.  StreamScan: fast scan algorithms for GPUs without global barrier synchronization , 2013, PPoPP '13.

[16]  Junsong Yuan,et al.  Sparse reconstruction cost for abnormal event detection , 2011, CVPR 2011.

[17]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[18]  Arnold W. M. Smeulders,et al.  Fine-Grained Categorization by Alignments , 2013, 2013 IEEE International Conference on Computer Vision.

[19]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[20]  Andrew Zisserman,et al.  Fisher Vector Faces in the Wild , 2013, BMVC.

[21]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[22]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[23]  Shengen Yan,et al.  CLSIFT: An Optimization Study of the Scale Invariance Feature Transform on GPUs , 2013, 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing.

[24]  Bálint Daróczy,et al.  GMM based Fisher vector calculation on GPGPU , 2011 .

[25]  Luc Van Gool,et al.  Fast scale invariant feature detection and matching on programmable graphics hardware , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[26]  Iasonas Kokkinos,et al.  Describing Textures in the Wild , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[28]  Thomas Mensink,et al.  The Rijksmuseum Challenge: Museum-Centered Visual Recognition , 2014, ICMR.

[29]  Naila Murray,et al.  Revisiting the Fisher vector for fine-grained classification , 2014, Pattern Recognit. Lett..

[30]  Nicu Sebe,et al.  Fisher kernel based relevance feedback for multimodal video retrieval , 2013, ICMR '13.

[31]  Cordelia Schmid,et al.  Combining attributes and Fisher vectors for efficient image retrieval , 2011, CVPR 2011.

[32]  Koen E. A. van de Sande,et al.  Empowering Visual Categorization With the GPU , 2011, IEEE Transactions on Multimedia.

[33]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[34]  Florent Perronnin,et al.  Large-scale image retrieval with compressed Fisher vectors , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[35]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[36]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[37]  Kurt Keutzer,et al.  Fast support vector machine training and classification on graphics processors , 2008, ICML '08.

[38]  Cewu Lu,et al.  Abnormal Event Detection at 150 FPS in MATLAB , 2013, 2013 IEEE International Conference on Computer Vision.

[39]  Nicu Sebe,et al.  Fisher Kernel Temporal Variation-based Relevance Feedback for video retrieval , 2016, Comput. Vis. Image Underst..