A codebook-free and annotation-free approach for fine-grained image categorization

Fine-grained categorization refers to the task of classifying objects that belong to the same basic-level class (e.g. different bird species) and share similar shape or visual appearances. Most of the state-of-the-art basic-level object classification algorithms have difficulties in this challenging problem. One reason for this can be attributed to the popular codebook-based image representation, often resulting in loss of subtle image information that are critical for fine-grained classification. Another way to address this problem is to introduce human annotations of object attributes or key points, a tedious process that is also difficult to generalize to new tasks. In this work, we propose a codebook-free and annotation-free approach for fine-grained image categorization. Instead of using vector-quantized codewords, we obtain an image representation by running a high throughput template matching process using a large number of randomly generated image templates. We then propose a novel bagging-based algorithm to build a final classifier by aggregating a set of discriminative yet largely uncorrelated classifiers. Experimental results show that our method outperforms state-of-the-art classification approaches on the Caltech-UCSD Birds dataset.

[1]  Wayne D. Gray,et al.  Basic objects in natural categories , 1976, Cognitive Psychology.

[2]  I. Biederman,et al.  Subordinate-level object classification reexamined , 1999, Psychological research.

[3]  Alan L. Yuille,et al.  The Concave-Convex Procedure (CCCP) , 2001, NIPS.

[4]  Pietro Perona,et al.  Object class recognition by unsupervised scale-invariant learning , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[5]  Alan L. Yuille,et al.  The Concave-Convex Procedure , 2003, Neural Computation.

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[8]  H. Deutsch Principle Component Analysis , 2004 .

[9]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[10]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[11]  D. Zhang,et al.  Principle Component Analysis , 2004 .

[12]  Motorcycles Faces Guitars Subordinate class recognition using relational object models , 2006 .

[13]  Andrew Zisserman,et al.  A Visual Vocabulary for Flower Classification , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[14]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[15]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[16]  Rich Caruana,et al.  On Feature Selection, Bias-Variance, and Bagging , 2009, ECML/PKDD.

[17]  Thomas G. Dietterich,et al.  Dictionary-free categorization of very similar objects via stacked evidence trees , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[21]  Thomas S. Huang,et al.  Image Classification Using Super-Vector Coding of Local Image Descriptors , 2010, ECCV.

[22]  Fei-Fei Li,et al.  Grouplet: A structured image representation for recognizing human and object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[23]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[24]  Fei-Fei Li,et al.  What Does Classifying More Than 10, 000 Image Categories Tell Us? , 2010, ECCV.

[25]  Pietro Perona,et al.  Caltech-UCSD Birds 200 , 2010 .

[26]  Pietro Perona,et al.  Visual Recognition with Humans in the Loop , 2010, ECCV.

[27]  Andrew W. Fitzgibbon,et al.  Efficient Object Category Recognition Using Classemes , 2010, ECCV.

[28]  Subhransu Maji,et al.  Action recognition from a distributed representation of pose and appearance , 2011, CVPR 2011.

[29]  Larry S. Davis,et al.  Birdlets: Subordinate categorization using volumetric primitives and pose-normalized appearance , 2011, 2011 International Conference on Computer Vision.

[30]  Vincent Lepetit,et al.  Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes , 2011, 2011 International Conference on Computer Vision.

[31]  Fei-Fei Li,et al.  Combining randomization and discrimination for fine-grained image categorization , 2011, CVPR 2011.

[32]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[33]  Pietro Perona,et al.  Multiclass recognition and part localization with humans in the loop , 2011, 2011 International Conference on Computer Vision.

[34]  Zeynep Akata,et al.  Fisher Vectors for Fine-Grained Visual Categorization , 2011, CVPR 2011.

[35]  Vincent Lepetit,et al.  Gradient Response Maps for Real-Time Detection of Textureless Objects , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.