A Fisher Kernel Approach for Multiple Instance Based Object Retrieval in Video Surveillance

This paper presents an automated surveillance system that exploits the Fisher Kernel representation in the context of multiple-instance object Retrieval task. The proposed algorithm has the main purpose of tracking a list of persons in several video sources, using only few training examples. In the first step, the Fisher Kernel representation describes a set of features as the derivative with respect to the log-likelihood of the generative probability distribution that models the feature distribution. Then, we learn the generative probability distribution over all features extracted from a reduced set of relevant frames. The proposed approach shows significant improvements and we demonstrate that Fisher kernels are well suited for this task. We demonstrate the generality of our approach in terms of features by conducting an extensive evaluation with a broad range of keypoints features. Also, we evaluate our method on two standard video surveillance datasets attaining superior results comparing to state-of-theart object recognition algorithms.

[1]  Lale Akarun,et al.  A multi-class classification strategy for Fisher scores: Application to signer independent sign language recognition , 2010, Pattern Recognit..

[2]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[3]  S. Muller-Schneiders,et al.  Performance evaluation of a real time video surveillance system , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[4]  Nazli Ikizler-Cinbis,et al.  Object, Scene and Actions: Combining Multiple Features for Human Action Recognition , 2010, ECCV.

[5]  Alex Pentland,et al.  Pfinder: real-time tracking of the human body , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[6]  Pedro J. Moreno,et al.  Using the Fisher kernel method for Web audio classification , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[7]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[8]  Ian D. Reid,et al.  Stable multi-target tracking in real-time surveillance video , 2011, CVPR 2011.

[9]  Urbano Nunes,et al.  Trainable classifier-fusion schemes: An application to pedestrian detection , 2009, 2009 12th International IEEE Conference on Intelligent Transportation Systems.

[10]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[11]  Matti Pietikäinen,et al.  Performance evaluation of texture measures with classification based on Kullback discrimination of distributions , 1994, Proceedings of 12th International Conference on Pattern Recognition.

[12]  Hanseok Ko,et al.  Selective Background Adaptation Based Abnormal Acoustic Event Recognition for Audio Surveillance , 2012, 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance.

[13]  Mario Cannataro,et al.  Protein-to-protein interactions: Technologies, databases, and algorithms , 2010, CSUR.

[14]  Touradj Ebrahimi,et al.  PEViD: privacy evaluation video dataset , 2013, Optics & Photonics - Optical Engineering + Applications.

[15]  Ramakant Nevatia,et al.  Evaluating multimedia features and fusion for example-based event detection , 2013, Machine Vision and Applications.

[16]  Olivier Chapelle,et al.  Training a Support Vector Machine in the Primal , 2007, Neural Computation.

[17]  Dennis Koelma,et al.  The MediaMill TRECVID 2008 Semantic Video Search Engine , 2008, TRECVID.

[18]  George Hripcsak,et al.  Technical Brief: Agreement, the F-Measure, and Reliability in Information Retrieval , 2005, J. Am. Medical Informatics Assoc..

[19]  Cordelia Schmid,et al.  Learning Color Names for Real-World Applications , 2009, IEEE Transactions on Image Processing.

[20]  Markus A. Stricker,et al.  Similarity of color images , 1995, Electronic Imaging.

[21]  Andrew Zisserman,et al.  The devil is in the details: an evaluation of recent feature encoding methods , 2011, BMVC.

[22]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[23]  Arnold W. M. Smeulders,et al.  Real-Time Visual Concept Classification , 2010, IEEE Transactions on Multimedia.

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  Ionut Mironica,et al.  Video Surveillance Classification-based Multiple Instance Object Retrieval : Evaluation and Dataset , 2014 .

[26]  Kpalma Kidiyo,et al.  A Survey of Shape Feature Extraction Techniques , 2008 .

[27]  Paulo Cortez,et al.  The OBSERVER: An Intelligent and Automated Video Surveillance System , 2006, ICIAR.

[28]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[29]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[30]  Nicu Sebe,et al.  Daily Living Activities Recognition via Efficient High and Low Level Cues Combination and Fisher Kernel Representation , 2013, ICIAP.

[31]  Xihong Wu,et al.  Text Segmentation with LDA-Based Fisher Kernel , 2008, ACL.

[32]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[33]  Montse Pardàs,et al.  Robust Tracking and Object Classification Towards Automated Video Surveillance , 2004, ICIAR.

[34]  Nicu Sebe,et al.  Behavior and properties of spatio-temporal local features under visual transformations , 2010, ACM Multimedia.

[35]  Yi Yang,et al.  Articulated Human Detection with Flexible Mixtures of Parts , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.