Deep Mutual Information Maximin for Cross-Modal Clustering

Cross-modal clustering (CMC) aims to enhance the clustering performance by exploring complementary information from multiple modalities. However, the performances of existing CMC algorithms are still unsatisfactory due to the conflict of heterogeneous modalities and the high-dimensional non-linear property of individual modality. In this paper, a novel deep mutual information maximin (DMIM) method for cross-modal clustering is proposed to maximally preserve the shared information of multiple modalities while eliminating the superfluous information of individual modalities in an end-to-end manner. Specifically, a multi-modal shared encoder is firstly built to align the latent feature distributions by sharing parameters across modalities. Then, DMIM formulates the complementarity of multi-modalities representations as a mutual information maximin objective function, in which the shared information of multiple modalities and the superfluous information of individual modalities are identified by mutual information maximization and minimization respectively. To solve the DMIM objective function, we propose a variational optimization method to ensure it converge to a local optimal solution. Moreover, an auxiliary overclustering mechanism is employed to optimize the clustering structure by introducing more detailed clustering classes. Extensive experimental results demonstrate the superiority of DMIM method over the state-of-the-art cross-modal clustering methods on IAPR-TC12, ESP-Game, MIRFlickr and NUSWide datasets.

[1]  Jun Wang,et al.  Multi-View Multi-Instance Multi-Label Learning based on Collaborative Matrix Factorization , 2019, AAAI.

[2]  Josef Kittler,et al.  Discriminative Learning and Recognition of Image Set Classes Using Canonical Correlations , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Jun Guo,et al.  Anchors Bring Ease: An Embarrassingly Simple Approach to Partial Multi-View Clustering , 2019, AAAI.

[4]  S. Shankar Sastry,et al.  Generalized principal component analysis (GPCA) , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Shotaro Akaho,et al.  A kernel method for canonical correlation analysis , 2006, ArXiv.

[6]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Pu Zhang,et al.  Tensor-SVD Based Graph Learning for Multi-View Subspace Clustering , 2020, AAAI.

[8]  Zhan Wang,et al.  Shared low-rank correlation embedding for multiple feature fusion , 2020 .

[9]  David Barber,et al.  The IM algorithm: a variational approach to Information Maximization , 2003, NIPS 2003.

[10]  Hui Yu,et al.  CMIB: Unsupervised Image Object Categorization in Multiple Visual Contexts , 2020, IEEE Transactions on Industrial Informatics.

[11]  Chang-Dong Wang,et al.  Multi-View Clustering in Latent Embedding Space , 2020, AAAI.

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Hui Yu,et al.  Synergetic information bottleneck for joint multi-view and ensemble clustering , 2020, Inf. Fusion.

[14]  Paul Clough,et al.  The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems , 2006 .

[15]  Yi-Dong Shen,et al.  End-to-End Adversarial-Attention Network for Multi-Modal Clustering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Huazhu Fu,et al.  AE2-Nets: Autoencoder in Autoencoder Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Zeynep Akata,et al.  Learning Robust Representations via Multi-View Information Bottleneck , 2020, ICLR.

[18]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[19]  Andrea Vedaldi,et al.  Self-labelling via simultaneous clustering and representation learning , 2020, ICLR.

[20]  Vishal M. Patel,et al.  Deep Multimodal Subspace Clustering Networks , 2018, IEEE Journal of Selected Topics in Signal Processing.

[21]  Jeff A. Bilmes,et al.  On Deep Multi-View Representation Learning , 2015, ICML.

[22]  Yangdong Ye,et al.  Multi-task Clustering of Human Actions by Sharing Information , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Ruslan Salakhutdinov,et al.  Learning Factorized Multimodal Representations , 2018, ICLR.

[24]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[25]  Zhaoyang Li,et al.  Deep Adversarial Multi-view Clustering Network , 2019, IJCAI.

[26]  Cai Xu,et al.  Adversarial Incomplete Multi-view Clustering , 2019, IJCAI.

[27]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[28]  Matthijs Douze,et al.  Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.

[29]  Yun Fu,et al.  Partial Multi-view Clustering via Consistent GAN , 2018, 2018 IEEE International Conference on Data Mining (ICDM).

[30]  Jiancheng Lv,et al.  COMIC: Multi-view Clustering Without Parameter Selection , 2019, ICML.

[31]  Ling Shao,et al.  Binary Multi-View Clustering , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Ling Shao,et al.  Highly-Economized Multi-View Binary Compression for Scalable Image Clustering , 2018, ECCV.

[33]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[34]  Changqing Zhang,et al.  Multi-view Deep Subspace Clustering Networks , 2019, ArXiv.

[35]  Mark J. Huiskes,et al.  The MIR flickr retrieval evaluation , 2008, MIR '08.

[36]  David W. Jacobs,et al.  Generalized Multiview Analysis: A discriminative latent space , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Ming-Wei Chang,et al.  Well-Read Students Learn Better: On the Importance of Pre-training Compact Models , 2019 .

[38]  Xu Ji,et al.  Invariant Information Clustering for Unsupervised Image Classification and Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  A. Shapiro Monte Carlo Sampling Methods , 2003 .

[40]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.