MultiDEC: Multi-Modal Clustering of Image-Caption Pairs

In this paper, we propose a method for clustering image-caption pairs by simultaneously learning image representations and text representations that are constrained to exhibit similar distributions. These image-caption pairs arise frequently in high-value applications where structured training data is expensive to produce but free-text descriptions are common. MultiDEC initializes parameters with stacked autoencoders, then iteratively minimizes the Kullback-Leibler divergence between the distribution of the images (and text) to that of a combined joint target distribution. We regularize by penalizing non-uniform distributions across clusters. The representations that minimize this objective produce clusters that outperform both single-view and multi-view techniques on large benchmark image-caption datasets.

[1]  Matthieu Cord,et al.  Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image Embeddings , 2018, SIGIR.

[2]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[3]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[4]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[5]  Dacheng Tao,et al.  Multi-View Learning With Incomplete Views , 2015, IEEE Transactions on Image Processing.

[6]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[7]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[8]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[9]  Jeff A. Bilmes,et al.  On Deep Multi-View Representation Learning , 2015, ICML.

[10]  Dhruv Batra,et al.  Joint Unsupervised Learning of Deep Representations and Image Clusters , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[12]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Olivier Gibaru,et al.  CNN features are also great at unsupervised classification , 2017, ArXiv.

[14]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[15]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[17]  Cheng Deng,et al.  Deep Clustering via Joint Convolutional Autoencoder Embedding and Relative Entropy Minimization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Xiangyang Xue,et al.  Cross-Modal Image Clustering via Canonical Correlation Analysis , 2015, AAAI.

[19]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[22]  Byron Boots,et al.  Improving Image Clustering With Multiple Pretrained CNN Feature Extractors , 2018, BMVC.

[23]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[24]  Hoifung Poon,et al.  EZLearn: Exploiting Organic Supervision in Large-Scale Data Annotation , 2017, ArXiv.

[25]  Vipin Kumar,et al.  The Challenges of Clustering High Dimensional Data , 2004 .

[26]  Ruslan Salakhutdinov,et al.  Learning Robust Visual-Semantic Embeddings , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Krystian Mikolajczyk,et al.  Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[29]  En Zhu,et al.  Deep Clustering with Convolutional Autoencoders , 2017, ICONIP.

[30]  Timothy Baldwin,et al.  An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation , 2016, Rep4NLP@ACL.

[31]  Sanja Fidler,et al.  What Are You Talking About? Text-to-Image Coreference , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Ali Farhadi,et al.  Unsupervised Deep Embedding for Clustering Analysis , 2015, ICML.

[33]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[34]  VincentPascal,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010 .

[35]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[36]  J. Koenderink Q… , 2014, Les noms officiels des communes de Wallonie, de Bruxelles-Capitale et de la communaute germanophone.

[37]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[38]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[39]  Bo Yang,et al.  Towards K-means-friendly Spaces: Simultaneous Deep Learning and Clustering , 2016, ICML.

[40]  Matthijs Douze,et al.  Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.

[41]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[42]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[43]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[44]  Senthil Mani,et al.  DLPaper2Code: Auto-generation of Code from Deep Learning Research Papers , 2017, AAAI.

[45]  Gerhard Widmer,et al.  End-to-end cross-modality retrieval with CCA projections and pairwise ranking loss , 2017, International Journal of Multimedia Information Retrieval.

[46]  Bill Howe,et al.  PhyloParser: A Hybrid Algorithm for Extracting Phylogenies from Dendrograms , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[47]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.