An angle-based method for measuring the semantic similarity between visual and textual features

The main challenge for most image–text tasks, such as zero-shot, is the way to measure the semantic similarity between visual and textual feature vectors. The common solution is to map the image feature vectors and text feature vectors into the Hilbert space and then rank the similarity by the inner product between feature vectors. In this paper, we learn the feature representation of images and their sentence descriptions by different deep neural networks to learn about the inner-modal correspondences between visual and language data. We then use a joint embedding structure based on angle calculation for measuring the semantic similarity between visual and textual features. In the proposed method, a constant factor b keeps the similarities of positive samples and negative samples at a certain distance. Since the proposed cosine similarity method involves both normalization and vectors computation, we also develop the learning algorithm on neural networks for expressing the semantic features of texts and images. We applied the angle-based method to the challenging Caltech-UCSD Birds and the Oxford-102 Flowers datasets. The experiments demonstrate good performances on both recognition and retrieval tasks.

[1]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[2]  Cordelia Schmid,et al.  Label-Embedding for Image Classification , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Yoshua Bengio,et al.  Zero-data Learning of New Tasks , 2008, AAAI.

[4]  Neha Agrawal,et al.  Comparison Clustering using Cosine and Fuzzy set based Similarity Measures of Text Documents , 2015, ArXiv.

[5]  Mann,et al.  Hilbert space representation of the minimal length uncertainty relation. , 1995, Physical review. D, Particles and fields.

[6]  Zhang Yi,et al.  Global Convergence of GHA Learning Algorithm With Nonzero-Approaching Adaptive Learning Rates , 2007, IEEE Transactions on Neural Networks.

[7]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[9]  Alex Graves,et al.  Long Short-Term Memory , 2020, Computer Vision.

[10]  Jian Cheng,et al.  Subspace Learning of Neural Networks , 2010 .

[11]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[12]  Pietro Perona,et al.  Caltech-UCSD Birds 200 , 2010 .

[13]  Giulianella Coletti,et al.  Weighted Attribute Combinations Based Similarity Measures , 2012, IPMU.

[14]  Anca L. Ralescu,et al.  Confusion Matrix-based Feature Selection , 2011, MAICS.

[15]  Li Bai,et al.  Cosine Similarity Metric Learning for Face Verification , 2010, ACCV.

[16]  Trevor Darrell,et al.  What you saw is not what you get: Domain adaptation using asymmetric kernel transforms , 2011, CVPR 2011.

[17]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Bernt Schiele,et al.  Evaluation of output embeddings for fine-grained image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[20]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[21]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[22]  Philip H. S. Torr,et al.  An embarrassingly simple approach to zero-shot learning , 2015, ICML.

[23]  Bernt Schiele,et al.  Learning Deep Representations of Fine-Grained Visual Descriptions , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[25]  Christoph H. Lampert,et al.  Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Svetlana Lazebnik,et al.  Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections , 2014, ECCV.

[27]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[28]  Zhang Yi,et al.  A Local Non-Negative Pursuit Method for Intrinsic Manifold Structure Preservation , 2014, AAAI.

[29]  Zhang Yi,et al.  Robust classifier using distance-based representation with square weights , 2015, Soft Comput..

[30]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[31]  Jiancheng Lv,et al.  Finding a good initial configuration of parameters for restricted Boltzmann machine pre-training , 2016, Soft Computing.

[32]  Jiasen Lu,et al.  VQA: Visual Question Answering , 2015, ICCV.

[33]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[34]  Jun Ye,et al.  Cosine similarity measures for intuitionistic fuzzy sets and their applications , 2011, Math. Comput. Model..

[35]  James R. Glass,et al.  Cosine Similarity Scoring without Score Normalization Techniques , 2010, Odyssey.

[36]  Geoffrey E. Hinton,et al.  Zero-shot Learning with Semantic Output Codes , 2009, NIPS.

[37]  Chin-Teng Lin,et al.  Training neural networks via simplified hybrid algorithm mixing Nelder–Mead and particle swarm optimization methods , 2015, Soft Comput..

[38]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[39]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[40]  Wei Xu,et al.  Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.

[41]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Chin-Teng Lin,et al.  An efficient quantum neuro-fuzzy classifier based on fuzzy entropy and compensatory operation , 2008, Soft Comput..