Comprehensive Semi-Supervised Multi-Modal Learning

Multi-modal learning refers to the process of learning a precise model to represent the joint representations of different modalities. Despite its promise for multi-modal learning, the co-regularization method is based on the consistency principle with a sufficient assumption, which usually does not hold for real-world multi-modal data. Indeed, due to the modal insufficiency in real-world applications, there are divergences among heterogeneous modalities. This imposes a critical challenge for multi-modal learning. To this end, in this paper, we propose a novel Comprehensive Multi-Modal Learning (CMML) framework, which can strike a balance between the consistency and divergency modalities by considering the insufficiency in one unified framework. Specifically, we utilize an instance level attention mechanism to weight the sufficiency for each instance on different modalities. Moreover, novel diversity regularization and robust consistency metrics are designed for discovering insufficient modalities. Our empirical studies show the superior performances of CMML on real-world data in terms of various criteria.

[1]  Hugo Jair Escalante,et al.  The segmented and annotated IAPR TC-12 benchmark , 2010, Comput. Vis. Image Underst..

[2]  Zhi-Hua Zhou,et al.  CoTrade: Confident Co-Training With Data Editing , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[3]  P. J. Huber Robust Estimation of a Location Parameter , 1964 .

[4]  R. Rosenfeld Nature , 2009, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[5]  Yuan Jiang,et al.  Auxiliary Information Regularized Machine for Multiple Modality Feature Learning , 2015, IJCAI.

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Maria-Florina Balcan,et al.  Co-Training and Expansion: Towards Bridging Theory and Practice , 2004, NIPS.

[8]  Feiping Nie,et al.  Multi-View Clustering and Feature Learning via Structured Sparsity , 2013, ICML.

[9]  Sanjoy Dasgupta,et al.  PAC Generalization Bounds for Co-training , 2001, NIPS.

[10]  Yuan Jiang,et al.  Complex Object Classification: A Multi-Modal Multi-Instance Multi-Label Deep Network with Optimal Transport , 2018, KDD.

[11]  Xiaolin Li,et al.  College Student Scholarships and Subsidies Granting: A Multi-modal Multi-label Approach , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[12]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.