论文信息 - MultiDEC: Multi-Modal Clustering of Image-Caption Pairs

MultiDEC: Multi-Modal Clustering of Image-Caption Pairs

In this paper, we propose a method for clustering image-caption pairs by simultaneously learning image representations and text representations that are constrained to exhibit similar distributions. These image-caption pairs arise frequently in high-value applications where structured training data is expensive to produce but free-text descriptions are common. MultiDEC initializes parameters with stacked autoencoders, then iteratively minimizes the Kullback-Leibler divergence between the distribution of the images (and text) to that of a combined joint target distribution. We regularize by penalizing non-uniform distributions across clusters. The representations that minimize this objective produce clusters that outperform both single-view and multi-view techniques on large benchmark image-caption datasets.

Bill Howe | Kuan-Hao Huang | Sean T. Yang

[1] Matthieu Cord,et al. Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image Embeddings , 2018, SIGIR.

[2] Ian T. Jolliffe,et al. Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[3] Ruslan Salakhutdinov,et al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[4] John Shawe-Taylor,et al. Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[5] Dacheng Tao,et al. Multi-View Learning With Incomplete Views , 2015, IEEE Transactions on Image Processing.

[6] Marc'Aurelio Ranzato,et al. DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[7] Pascal Vincent,et al. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[8] Cyrus Rashtchian,et al. Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[9] Jeff A. Bilmes,et al. On Deep Multi-View Representation Learning , 2015, ICML.

[10] Dhruv Batra,et al. Joint Unsupervised Learning of Deep Representations and Image Clusters , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Jeff A. Bilmes,et al. Deep Canonical Correlation Analysis , 2013, ICML.

[12] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[13] Olivier Gibaru,et al. CNN features are also great at unsupervised classification , 2017, ArXiv.

[14] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[15] Yin Li,et al. Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[17] Cheng Deng,et al. Deep Clustering via Joint Convolutional Autoencoder Embedding and Relative Entropy Minimization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18] Xiangyang Xue,et al. Cross-Modal Image Clustering via Canonical Correlation Analysis , 2015, AAAI.

[19] Derek Hoiem,et al. Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[20] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Tomas Mikolov,et al. Bag of Tricks for Efficient Text Classification , 2016, EACL.

[22] Byron Boots,et al. Improving Image Clustering With Multiple Pretrained CNN Feature Extractors , 2018, BMVC.

[23] Wei Xu,et al. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[24] Hoifung Poon,et al. EZLearn: Exploiting Organic Supervision in Large-Scale Data Annotation , 2017, ArXiv.

[25] Vipin Kumar,et al. The Challenges of Clustering High Dimensional Data , 2004 .

[26] Ruslan Salakhutdinov,et al. Learning Robust Visual-Semantic Embeddings , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27] Krystian Mikolajczyk,et al. Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[29] En Zhu,et al. Deep Clustering with Convolutional Autoencoders , 2017, ICONIP.

[30] Timothy Baldwin,et al. An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation , 2016, Rep4NLP@ACL.

[31] Sanja Fidler,et al. What Are You Talking About? Text-to-Image Coreference , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32] Ali Farhadi,et al. Unsupervised Deep Embedding for Clustering Analysis , 2015, ICML.

[33] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[34] VincentPascal,et al. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010 .

[35] Quoc V. Le,et al. Distributed Representations of Sentences and Documents , 2014, ICML.

[36] J. Koenderink. Q… , 2014, Les noms officiels des communes de Wallonie, de Bruxelles-Capitale et de la communaute germanophone.

[37] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[38] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[39] Bo Yang,et al. Towards K-means-friendly Spaces: Simultaneous Deep Learning and Clustering , 2016, ICML.

[40] Matthijs Douze,et al. Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.

[41] Tomas Mikolov,et al. Enriching Word Vectors with Subword Information , 2016, TACL.

[42] Marc'Aurelio Ranzato,et al. Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[43] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.

[44] Senthil Mani,et al. DLPaper2Code: Auto-generation of Code from Deep Learning Research Papers , 2017, AAAI.

[45] Gerhard Widmer,et al. End-to-end cross-modality retrieval with CCA projections and pairwise ranking loss , 2017, International Journal of Multimedia Information Retrieval.

[46] Bill Howe,et al. PhyloParser: A Hybrid Algorithm for Extracting Phylogenies from Dendrograms , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[47] Danna Zhou,et al. d. , 1934, Microbial pathogenesis.