Learning Disentangled Latent Factors from Paired Data in Cross-Modal Retrieval: An Implicit Identifiable VAE Approach

We deal with the problem of learning the underlying disentangled latent factors that are shared between the paired bi-modal data in cross-modal retrieval. Our assumption is that the data in both modalities are complex, structured, and high dimensional (e.g., image and text), for which the conventional deep auto-encoding latent variable models such as the Variational Autoencoder (VAE) often suffer from difficulty of accurate decoder training or realistic synthesis. A suboptimally trained decoder can potentially harm the model's capability of identifying the true factors. In this paper we propose a novel idea of the implicit decoder, which completely removes the ambient data decoding module from a latent variable model, via implicit encoder inversion that is achieved by Jacobian regularization of the low-dimensional embedding function. Motivated from the recent Identifiable VAE (IVAE) model, we modify it to incorporate the query modality data as conditioning auxiliary input, which allows us to prove that the true parameters of the model can be identified under some regularity conditions. Tested on various datasets where the true factors are fully/partially available, our model is shown to identify the factors accurately, significantly outperforming conventional encoder-decoder latent variable models. We also test our model on the Recipe1M, the large-scale food image/recipe dataset, where the learned factors by our approach highly coincide with the most pronounced food factors that are widely agreed on, including savoriness, wateriness, and greenness.

[1]  Ling Shao,et al.  Hetero-Manifold Regularisation for Cross-Modal Hashing , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[3]  Amaia Salvador,et al.  Learning Cross-Modal Embeddings for Cooking Recipes and Food Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[5]  Christopher K. I. Williams,et al.  A Framework for the Quantitative Evaluation of Disentangled Representations , 2018, ICLR.

[6]  Abhishek Kumar,et al.  Variational Inference of Disentangled Latent Concepts from Unlabeled Observations , 2017, ICLR.

[7]  Yang Yang,et al.  Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[8]  Roger B. Grosse,et al.  Isolating Sources of Disentanglement in Variational Autoencoders , 2018, NeurIPS.

[9]  Martin J. Wainwright,et al.  Estimating Divergence Functionals and the Likelihood Ratio by Convex Risk Minimization , 2008, IEEE Transactions on Information Theory.

[10]  Max Welling Donald,et al.  Products of Experts , 2007 .

[11]  Philip S. Yu,et al.  Deep Visual-Semantic Hashing for Cross-Modal Retrieval , 2016, KDD.

[12]  Aapo Hyvärinen,et al.  Variational Autoencoders and Nonlinear ICA: A Unifying Framework , 2019, AISTATS.

[13]  Antonio Torralba,et al.  Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[15]  Ling Shao,et al.  Semi-supervised vision-language mapping via variational learning , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[16]  Vladimir Pavlovic,et al.  Bayes-Factor-VAE: Hierarchical Bayesian Deep Auto-Encoder Models for Factor Disentanglement , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Navdeep Jaitly,et al.  Adversarial Autoencoders , 2015, ArXiv.

[18]  Chong-Wah Ngo,et al.  Cross-modal Recipe Retrieval with Rich Food Attributes , 2017, ACM Multimedia.

[19]  Yuxin Peng,et al.  CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network , 2017, IEEE Transactions on Multimedia.

[20]  Yee Whye Teh,et al.  Disentangling Disentanglement in Variational Autoencoders , 2018, ICML.

[21]  Chong-Wah Ngo,et al.  Deep Understanding of Cooking Procedure for Cross-modal Recipe Retrieval , 2018, ACM Multimedia.

[22]  Yoshua Bengio,et al.  Learning Independent Features with Adversarial Nets for Non-linear ICA , 2017, 1710.05050.

[23]  Masashi Sugiyama,et al.  Density-ratio matching under the Bregman divergence: a unified framework of density-ratio estimation , 2012 .

[24]  Honglak Lee,et al.  Deep Variational Canonical Correlation Analysis , 2016, ArXiv.

[25]  Bernhard Schölkopf,et al.  Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations , 2018, ICML.

[26]  Steven C. H. Hoi,et al.  Learning Cross-Modal Embeddings With Adversarial Networks for Cooking Recipes and Food Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[28]  Mike Wu,et al.  Multimodal Generative Models for Scalable Weakly-Supervised Learning , 2018, NeurIPS.

[29]  Chao Zhang,et al.  Deep Joint-Semantics Reconstructing Hashing for Large-Scale Unsupervised Cross-Modal Retrieval , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Wu-Jun Li,et al.  Deep Cross-Modal Hashing , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[33]  Andriy Mnih,et al.  Disentangling by Factorising , 2018, ICML.

[34]  Michael I. Jordan,et al.  A Probabilistic Interpretation of Canonical Correlation Analysis , 2005 .

[35]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[36]  Chong-Wah Ngo,et al.  Cross-Modal Recipe Retrieval: How to Cook this Dish? , 2017, MMM.