Real-Time Visual Concept Classification

As datasets grow increasingly large in content-based image and video retrieval, computational efficiency of concept classification is important. This paper reviews techniques to accelerate concept classification, where we show the trade-off between computational efficiency and accuracy. As a basis, we use the Bag-of-Words algorithm that in the 2008 benchmarks of TRECVID and PASCAL lead to the best performance scores. We divide the evaluation in three steps: 1) Descriptor Extraction, where we evaluate SIFT, SURF, DAISY, and Semantic Textons. 2) Visual Word Assignment, where we compare a k-means visual vocabulary with a Random Forest and evaluate subsampling, dimension reduction with PCA, and division strategies of the Spatial Pyramid. 3) Classification, where we evaluate the χ2, RBF, and Fast Histogram Intersection kernel for the SVM. Apart from the evaluation, we accelerate the calculation of densely sampled SIFT and SURF, accelerate nearest neighbor assignment, and improve accuracy of the Histogram Intersection kernel. We conclude by discussing whether further acceleration of the Bag-of-Words pipeline is possible. Our results lead to a 7-fold speed increase without accuracy loss, and a 70-fold speed increase with 3% accuracy loss. The latter system does classification in real-time, which opens up new applications for automatic concept classification. For example, this system permits five standard desktop PCs to automatically tag for 20 classes all images that are currently uploaded to Flickr.

[1]  Joost van de Weijer,et al.  Fast Anisotropic Gauss Filtering , 2002, ECCV.

[2]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[3]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[4]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[5]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[6]  Lixin Fan,et al.  Categorizing Nine Visual Classes using Local Appearance Descriptors , 2004 .

[7]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[8]  Frédéric Jurie,et al.  Creating efficient codebooks for visual recognition , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[9]  Bernt Schiele,et al.  Local features for object class recognition , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[10]  Cordelia Schmid,et al.  A Performance Evaluation of Local Descriptors , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[12]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[13]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[14]  Marcel Worring,et al.  The challenge problem for automated detection of 101 semantic concepts in multimedia , 2006, MM '06.

[15]  Frédéric Jurie,et al.  Sampling Strategies for Bag-of-Features Image Classification , 2006, ECCV.

[16]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[17]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[18]  Frédéric Jurie,et al.  Fast Discriminative Visual Codebooks using Randomized Clustering Forests , 2006, NIPS.

[19]  Dennis Koelma,et al.  The MediaMill TRECVID 2008 Semantic Video Search Engine , 2008, TRECVID.

[20]  Horst Bischof,et al.  Fast Approximated SIFT , 2006, ACCV.

[21]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[22]  Jiri Matas,et al.  Learning a Fast Emulator of a Binary Decision Process , 2007, ACCV.

[23]  Jiri Matas,et al.  Improving Descriptors for Fast Tree Matching by Optimal Linear Projection , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[24]  Cordelia Schmid,et al.  Vector Quantizing Feature Space with a Regular Lattice , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[25]  Chong-Wah Ngo,et al.  Towards optimal bag-of-features for object categorization and semantic video retrieval , 2007, CIVR '07.

[26]  nVIDIA社 CUDA Programming Guide 1.1 , 2007 .

[27]  Cordelia Schmid,et al.  Learning Object Representations for Visual Object Class Recognition , 2007, ICCV 2007.

[28]  Kurt Keutzer,et al.  Fast support vector machine training and classification on graphics processors , 2008, ICML '08.

[29]  Frédéric Jurie,et al.  Randomized Clustering Forests for Image Classification , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Koen E. A. van de Sande,et al.  A comparison of color features for visual concept classification , 2008, CIVR '08.

[31]  Vincent Lepetit,et al.  A fast local descriptor for dense matching , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Toby Sharp,et al.  Implementing Decision Trees and Forests on a GPU , 2008, ECCV.

[33]  Lei Wang,et al.  A Fast Algorithm for Creating a Compact and Discriminative Visual Codebook , 2008, ECCV.

[34]  Subhransu Maji,et al.  Classification using intersection kernel support vector machines is efficient , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Michael Isard,et al.  Lost in quantization: Improving particular object retrieval in large scale image databases , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[37]  Roberto Cipolla,et al.  Semantic texton forests for image categorization and segmentation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Shih-Fu Chang,et al.  Fast kernel learning for spatial pyramid matching , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Alberto Del Bimbo,et al.  Video Event Classification Using Bag of Words and String Kernels , 2009, ICIAP.

[40]  Marcel Worring,et al.  Concept-Based Video Retrieval , 2009, Found. Trends Inf. Retr..

[41]  Arnold W. M. Smeulders,et al.  Real-time bag of words, approximately , 2009, CIVR '09.

[42]  Cor J. Veenman,et al.  Visual Word Ambiguity , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[44]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .