论文信息 - Evaluation of output embeddings for fine-grained image classification

Evaluation of output embeddings for fine-grained image classification

Image classification has advanced significantly in recent years with the availability of large-scale image sets. However, fine-grained classification remains a major challenge due to the annotation cost of large numbers of fine-grained categories. This project shows that compelling classification performance can be achieved on such categories even without labeled training data. Given image and class embeddings, we learn a compatibility function such that matching embeddings are assigned a higher score than mismatching ones; zero-shot classification of an image proceeds by finding the label yielding the highest joint compatibility score. We use state-of-the-art image features and focus on different supervised attributes and unsupervised output embeddings either derived from hierarchies or learned from unlabeled text corpora. We establish a substantially improved state-of-the-art on the Animals with Attributes and Caltech-UCSD Birds datasets. Most encouragingly, we demonstrate that purely unsupervised output embeddings (learned from Wikipedia and improved with finegrained text) achieve compelling results, even outperforming the previous supervised state-of-the-art. By combining different output embeddings, we further improve results.

[1] Zellig S. Harris,et al. Distributional Structure , 1954 .

[2] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.

[3] C. Leacock,et al. Filling in a sparse training space for word sense identification , 1994 .

[4] Philip Resnik,et al. Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[5] Thomas G. Dietterich,et al. Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[6] David W. Conrath,et al. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[7] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[8] Dekang Lin,et al. An Information-Theoretic Definition of Similarity , 1998, ICML.

[9] Yoshua Bengio,et al. A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[10] Trevor Hastie,et al. The Elements of Statistical Learning , 2001 .

[11] Simon Haykin,et al. GradientBased Learning Applied to Document Recognition , 2001 .

[12] Mounira Harzallah,et al. A Typology Of Ontology-Based Semantic Measures , 2005, EMOI-INTEROP.

[13] Thomas Hofmann,et al. Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[14] Florent Perronnin,et al. Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[15] Andrew Zisserman,et al. Learning Visual Attributes , 2007, NIPS.

[16] Geoffrey E. Hinton,et al. Zero-shot Learning with Semantic Output Codes , 2009, NIPS.

[17] Ali Farhadi,et al. Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18] John Langford,et al. Multi-Label Prediction via Compressed Sensing , 2009, NIPS.

[19] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[20] Bernt Schiele,et al. What Helps Where \textendash And Why? Semantic Relatedness for Knowledge Transfer , 2010, CVPR 2010.

[21] Thomas Mensink,et al. Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[22] Bernt Schiele,et al. What helps where – and why? Semantic relatedness for knowledge transfer , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[23] Ali Farhadi,et al. Attribute-centric recognition for cross-category generalization , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[24] Xiaodong Yu,et al. Attribute-Based Transfer Learning for Object Categorization with Zero/One Training Example , 2010, ECCV.

[25] Jason Weston,et al. Label Embedding Trees for Large Multi-Class Tasks , 2010, NIPS.

[26] Pietro Perona,et al. Caltech-UCSD Birds 200 , 2010 .

[27] Jason Weston,et al. Large scale image annotation: learning to rank with joint word-image embeddings , 2010, Machine Learning.

[28] Bernt Schiele,et al. Combining Language Sources and Robust Semantic Relatedness for Attribute-Based Knowledge Transfer , 2010, ECCV Workshops.

[29] Andrea Vedaldi,et al. Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[30] Cordelia Schmid,et al. Combining attributes and Fisher vectors for efficient image retrieval , 2011, CVPR 2011.

[31] Yejin Choi,et al. Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[32] Kristen Grauman,et al. Relative attributes , 2011, 2011 International Conference on Computer Vision.

[33] Silvio Savarese,et al. Recognizing human actions by attributes , 2011, CVPR 2011.

[34] Vicente Ordonez,et al. Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[35] Leonidas J. Guibas,et al. Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[36] Sebastian Nowozin,et al. Structured Learning and Prediction in Computer Vision , 2011, Found. Trends Comput. Graph. Vis..

[37] Bernt Schiele,et al. Evaluating knowledge transfer and zero-shot learning in a large-scale setting , 2011, CVPR 2011.

[38] Larry S. Davis,et al. Image ranking and retrieval based on multi-attribute queries , 2011, CVPR 2011.

[39] Pietro Perona,et al. Multiclass recognition and part localization with humans in the loop , 2011, 2011 International Conference on Computer Vision.

[40] Kun Duan,et al. Discovering localized attributes for fine-grained recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[41] Terrance E. Boult,et al. Multi-attribute spaces: Calibration for attribute fusion and similarity search , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[42] W. Marsden. I and J , 2012 .

[43] Aram Kawewong,et al. Online incremental attribute-based zero-shot learning , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[44] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[45] Shih-Fu Chang,et al. Designing Category-Level Attributes for Discriminative Visual Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[46] Huizhong Chen,et al. What's in a Name? First Names as Facial Attributes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[47] Marc'Aurelio Ranzato,et al. DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[48] Jonathan Krause,et al. Fine-Grained Crowdsourcing for Fine-Grained Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[49] Andrew Y. Ng,et al. Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[50] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[51] Cordelia Schmid,et al. Label-Embedding for Attribute-Based Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[52] Omer Levy,et al. word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[53] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[54] Christoph H. Lampert,et al. Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55] Thomas Brox,et al. Discriminative Unsupervised Feature Learning with Convolutional Neural Networks , 2014, NIPS.

[56] Trevor Darrell,et al. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[57] Samy Bengio,et al. Zero-Shot Learning by Convex Combination of Semantic Embeddings , 2013, ICLR.

[58] Shaogang Gong,et al. Transductive Multi-view Embedding for Zero-Shot Recognition and Annotation , 2014, ECCV.

[59] Cees Snoek,et al. COSTA: Co-Occurrence Statistics for Zero-Shot Classification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[60] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[61] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[62] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63] Cordelia Schmid,et al. Label-Embedding for Image Classification , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64] Thomas Brox,et al. Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.