Cross-modal Center Loss

Cross-modal retrieval aims to learn discriminative and modal-invariant features for data from different modalities. Unlike the existing methods which usually learn from the features extracted by offline networks, in this paper, we propose an approach to jointly train the components of cross-modal retrieval framework with metadata, and enable the network to find optimal features. The proposed end-to-end framework is updated with three loss functions: 1) a novel cross-modal center loss to eliminate cross-modal discrepancy, 2) cross-entropy loss to maximize inter-class variations, and 3) mean-square-error loss to reduce modality variations. In particular, our proposed cross-modal center loss minimizes the distances of features from objects belonging to the same class across all modalities. Extensive experiments have been conducted on the retrieval tasks across multi-modalities, including 2D image, 3D point cloud, and mesh data. The proposed framework significantly outperforms the state-of-the-art methods on the ModelNet40 dataset.

[1]  Thomas Brox,et al.  Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[2]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[3]  Yue Gao,et al.  MeshNet: Mesh Neural Network for 3D Shape Representation , 2018, AAAI.

[4]  Jeff A. Bilmes,et al.  On Deep Multi-View Representation Learning , 2015, ICML.

[5]  Sebastian Scherer,et al.  VoxNet: A 3D Convolutional Neural Network for real-time object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[6]  Yu Qiao,et al.  A Discriminative Feature Learning Approach for Deep Face Recognition , 2016, ECCV.

[7]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[8]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[9]  Bernard Ghanem,et al.  DeepGCNs: Can GCNs Go As Deep As CNNs? , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Shiguang Shan,et al.  Multi-View Discriminant Analysis , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Subhransu Maji,et al.  A Deeper Look at 3D Shape Classifiers , 2018, ECCV Workshops.

[12]  Daniel Cohen-Or,et al.  MeshCNN: a network with an edge , 2019, ACM Trans. Graph..

[13]  Chen Fang,et al.  Visual to Sound: Generating Natural Sound for Videos in the Wild , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Dezhong Peng,et al.  Deep Supervised Cross-Modal Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Yue Wang,et al.  Dynamic Graph CNN for Learning on Point Clouds , 2018, ACM Trans. Graph..

[18]  Yuxin Peng,et al.  CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network , 2017, IEEE Transactions on Multimedia.

[19]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[20]  Yueting Zhuang,et al.  Supervised Coupled Dictionary Learning with Group Structures for Multi-modal Retrieval , 2013, AAAI.

[21]  Leonidas J. Guibas,et al.  KPConv: Flexible and Deformable Convolution for Point Clouds , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Xiaohua Zhai,et al.  Learning Cross-Media Joint Representation With Sparse and Semisupervised Regularization , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[23]  Jianxiong Xiao,et al.  3D ShapeNets: A deep representation for volumetric shapes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Chen Wang,et al.  Supervised Contrastive Learning , 2020, NeurIPS.

[25]  Victor S. Lempitsky,et al.  Escape from Cells: Deep Kd-Networks for the Recognition of 3D Point Cloud Models , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Andrew Zisserman,et al.  Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Fuxin Li,et al.  PointConv: Deep Convolutional Networks on 3D Point Clouds , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Ajmal Mian,et al.  Spherical Kernel for Efficient Graph Convolution on 3D Point Clouds , 2020, IEEE transactions on pattern analysis and machine intelligence.

[29]  Jeffrey S. Racine,et al.  CROSS-VALIDATED LOCAL LINEAR NONPARAMETRIC REGRESSION , 2004 .

[30]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[32]  Leonidas J. Guibas,et al.  PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space , 2017, NIPS.

[33]  Yingli Tian,et al.  Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Leonidas J. Guibas,et al.  Volumetric and Multi-view CNNs for Object Classification on 3D Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Yang Yang,et al.  Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[36]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.