Empowering Visual Categorization With the GPU

Visual categorization is important to manage large collections of digital images and video, where textual metadata is often incomplete or simply unavailable. The bag-of-words model has become the most powerful method for visual categorization of images and video. Despite its high accuracy, a severe drawback of this model is its high computational cost. As the trend to increase computational power in newer CPU and GPU architectures is to increase their level of parallelism, exploiting this parallelism becomes an important direction to handle the computational cost of the bag-of-words approach. When optimizing a system based on the bag-of-words approach, the goal is to minimize the time it takes to process batches of images. this paper, we analyze the bag-of-words model for visual categorization in terms of computational cost and identify two major bottlenecks: the quantization step and the classification step. We address these two bottlenecks by proposing two efficient algorithms for quantization and classification by exploiting the GPU hardware and the CUDA parallel programming model. The algorithms are designed to (1) keep categorization accuracy intact, (2) decompose the problem, and (3) give the same numerical results. In the experiments on large scale datasets, it is shown that, by using a parallel implementation on the Geforce GTX260 GPU, classifying unseen images is 4.8 times faster than a quad-core CPU version on the Core i7 920, while giving the exact same numerical results. In addition, we show how the algorithms can be generalized to other applications, such as text retrieval and video retrieval. Moreover, when the obtained speedup is used to process extra video frames in a video retrieval benchmark, the accuracy of visual categorization is improved by 29%.

[1]  William Kahan,et al.  Pracniques: further remarks on reducing truncation errors , 1965, CACM.

[2]  Andrew R. Webb,et al.  Statistical Pattern Recognition , 1999 .

[3]  Joost van de Weijer,et al.  Fast Anisotropic Gauss Filtering , 2002, ECCV.

[4]  J. Wade Davis,et al.  Statistical Pattern Recognition , 2003, Technometrics.

[5]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[6]  Cordelia Schmid,et al.  A Comparison of Affine Region Detectors , 2005, International Journal of Computer Vision.

[7]  Marcel Worring,et al.  On the surplus value of semantic video analysis beyond the key frame , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[8]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[9]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[10]  Marcel Worring,et al.  The challenge problem for automated detection of 101 semantic concepts in multimedia , 2006, MM '06.

[11]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[12]  Chin-Chen Chang,et al.  Fast codebook search algorithms based on tree-structured vector quantization , 2006, Pattern Recognition Letters.

[13]  Frédéric Jurie,et al.  Fast Discriminative Visual Codebooks using Randomized Clustering Forests , 2006, NIPS.

[14]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[15]  Marcel Worring,et al.  High-Performance Distributed Video Content Analysis with Parallel-Horus , 2007, IEEE MultiMedia.

[16]  Dong Wang,et al.  Video diver: generic video indexing with diverse features , 2007, MIR '07.

[17]  Jiawei Han,et al.  Efficient Kernel Discriminant Analysis via Spectral Regression , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[18]  Marcel Worring,et al.  High-Performance Distributed Image and Video Content Analysis with Parallel-Horus , 2007 .

[19]  Trevor Darrell,et al.  The Pyramid Match Kernel: Efficient Learning with Sets of Features , 2007, J. Mach. Learn. Res..

[20]  Jan-Michael Frahm,et al.  Feature tracking and matching in video using programmable graphics hardware , 2007, Machine Vision and Applications.

[21]  Cordelia Schmid,et al.  Learning Object Representations for Visual Object Class Recognition , 2007, ICCV 2007.

[22]  James Ze Wang,et al.  Image retrieval: Ideas, influences, and trends of the new age , 2008, CSUR.

[23]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[24]  Chong-Wah Ngo,et al.  Columbia University/VIREO-CityU/IRIT TRECVID2008 High-Level Feature Extraction and Interactive Video Search , 2008, TRECVID.

[25]  Kurt Keutzer,et al.  Fast support vector machine training and classification on graphics processors , 2008, ICML '08.

[26]  François Poulet,et al.  Speed Up SVM Algorithm for Massive Classification Tasks , 2008, ADMA.

[27]  Wen-mei W. Hwu,et al.  MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs , 2008, LCPC.

[28]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[29]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[30]  Ming Ouyang,et al.  COMPUTE PAIRWISE EUCLIDEAN DISTANCES OF DATA POINTS WITH GPUS , 2008 .

[31]  Yao Zhang,et al.  Parallel Computing Experiences with CUDA , 2008, IEEE Micro.

[32]  Toby Sharp,et al.  Implementing Decision Trees and Forests on a GPU , 2008, ECCV.

[33]  Saleh Omran,et al.  A Review of SIMD Multimedia Extensions and their Usage in Scientific and Engineering Applications , 2008, Comput. J..

[34]  Luc Van Gool,et al.  Fast scale invariant feature detection and matching on programmable graphics hardware , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[35]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[36]  Cordelia Schmid,et al.  The Pascal Visual Object Classes Challenge 2008 submission , 2008 .

[37]  Andrew Kerr,et al.  Translating GPU Binaries to Tiered SIMD Architectures with Ocelot , 2009 .

[38]  Marcel Worring,et al.  The MediaMill TRECVID 2009 Semantic Video Search Engine , 2009, TRECVID.

[39]  Cordelia Schmid,et al.  Packing bag-of-features , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[40]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[41]  Arnold W. M. Smeulders,et al.  Real-time bag of words, approximately , 2009, CIVR '09.

[42]  Gregory Diamos The Design and Implementation Ocelot’s Dynamic Binary Translator from PTX to Multi-Core x86 , 2009 .

[43]  Murat Efe Guney,et al.  On the limits of GPU acceleration , 2010 .

[44]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Chong-Wah Ngo,et al.  Representations of Keypoint-Based Semantic Concept Detection: A Comprehensive Study , 2010, IEEE Transactions on Multimedia.

[46]  Cor J. Veenman,et al.  Visual Word Ambiguity , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[48]  Laura Hollink,et al.  Search behavior of media professionals at an audiovisual archive: A transaction log analysis , 2010 .

[49]  Uday Bondhugula,et al.  Believe it or Not! Multicore CPUs can Match GPUs for FLOP-intensive Applications , 2010 .

[50]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[51]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .