Visually Aligned Word Embeddings for Improving Zero-shot Learning

Zero-shot learning (ZSL) highly depends on a good semantic embedding to connect the seen and unseen classes. Recently, distributed word embeddings (DWE) pre-trained from large text corpus have become a popular choice to draw such a connection. Compared with human defined attributes, DWEs are more scalable and easier to obtain. However, they are designed to reflect semantic similarity rather than visual similarity and thus using them in ZSL often leads to inferior performance. To overcome this visual-semantic discrepancy, this work proposes an objective function to re-align the distributed word embeddings with visual information by learning a neural network to map it into a new representation called visually aligned word embedding (VAWE). Thus the neighbourhood structure of VAWEs becomes similar to that in the visual domain. Note that in this work we do not design a ZSL method that projects the visual features and semantic embeddings onto a shared space but just impose a requirement on the structure of the mapped word embeddings. This strategy allows the learned VAWE to generalize to various ZSL methods and visual features. As evaluated via four state-of-the-art ZSL methods on four benchmark datasets, the VAWE exhibit consistent performance improvement.

[1]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  Yanwei Fu,et al.  Semi-supervised Vocabulary-Informed Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[5]  Christoph H. Lampert,et al.  Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Bernt Schiele,et al.  Latent Embeddings for Zero-Shot Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Krista A. Ehinger,et al.  SUN Database: Exploring a Large Collection of Scene Categories , 2014, International Journal of Computer Vision.

[8]  Ling Shao,et al.  Attribute Embedding with Visual-Semantic Ambiguity Removal for Zero-shot Learning , 2016, BMVC.

[9]  Dale Schuurmans,et al.  Semi-Supervised Zero-Shot Classification with Label Representation Learning , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  Ling Shao,et al.  From Zero-Shot Learning to Conventional Supervised Classification: Unseen Visual Data Synthesis , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Samy Bengio,et al.  Zero-Shot Learning by Convex Combination of Semantic Embeddings , 2013, ICLR.

[12]  Sanja Fidler,et al.  Predicting Deep Zero-Shot Convolutional Neural Networks Using Textual Descriptions , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[14]  Samy Bengio,et al.  Large-Scale Object Classification Using Label Relation Graphs , 2014, ECCV.

[15]  Georgiana Dinu,et al.  Improving zero-shot learning by mitigating the hubness problem , 2014, ICLR.

[16]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Philip H. S. Torr,et al.  Prototypical Priors: From Improving Classification to Zero-Shot Learning , 2015, BMVC.

[18]  José M. F. Moura,et al.  VisualWord2Vec (Vis-W2V): Learning Visually Grounded Word Embeddings Using Abstract Scenes , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[20]  Venkatesh Saligrama,et al.  Zero-Shot Learning via Joint Latent Similarity Embedding , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Angeliki Lazaridou,et al.  Combining Language and Vision with a Multimodal Skip-gram Model , 2015, NAACL.

[22]  Bernt Schiele,et al.  Evaluation of output embeddings for fine-grained image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Philip H. S. Torr,et al.  An embarrassingly simple approach to zero-shot learning , 2015, ICML.

[24]  Anton van den Hengel,et al.  Less is More: Zero-Shot Learning from Online Textual Documents with Noise Suppression , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Andrew Y. Ng,et al.  Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[26]  Venkatesh Saligrama,et al.  Zero-Shot Learning via Semantic Similarity Embedding , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[27]  Cordelia Schmid,et al.  Label-Embedding for Attribute-Based Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.