论文信息 - Improved Multimodal Deep Learning with Variation of Information

Improved Multimodal Deep Learning with Variation of Information

Deep learning has been successfully applied to multimodal representation learning problems, with a common strategy to learning joint representations that are shared across multiple modalities on top of layers of modality-specific networks. Nonetheless, there still remains a question how to learn a good association between data modalities; in particular, a good generative model of multimodal data should be able to reason about missing data modality given the rest of data modalities. In this paper, we propose a novel multimodal representation learning framework that explicitly aims this goal. Rather than learning with maximum likelihood, we train the model to minimize the variation of information. We provide a theoretical insight why the proposed learning objective is sufficient to estimate the data-generating joint distribution of multimodal data. We apply our method to restricted Boltzmann machines and introduce learning methods based on contrastive divergence and multi-prediction training. In addition, we extend to deep networks with recurrent encoding structure to finetune the whole network. In experiments, we demonstrate the state-of-the-art visual recognition performance on MIR-Flickr database and PASCAL VOC 2007 database with and without text features.

[1] Geoffrey E. Hinton,et al. Replicated Softmax: an Undirected Topic Model , 2009, NIPS.

[2] Cordelia Schmid,et al. Image annotation with tagprop on the MIRFLICKR set , 2010, MIR '10.

[3] David Maxwell Chickering,et al. Dependency Networks for Inference, Collaborative Filtering, and Data Visualization , 2000, J. Mach. Learn. Res..

[4] Nitish Srivastava,et al. Discriminative Transfer Learning with Tree-based Priors , 2013, NIPS.

[5] Geoffrey E. Hinton. Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[6] Stefan B. Williams,et al. Multimodal learning for autonomous underwater vehicles from visual and bathymetric data , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[7] Yoshua Bengio,et al. Multi-Prediction Deep Boltzmann Machines , 2013, NIPS.

[8] Honglak Lee,et al. Deep learning for detecting robotic grasps , 2013, Int. J. Robotics Res..

[9] Honglak Lee,et al. Deep learning for robust feature generation in audiovisual emotion recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.

[11] Pascal Vincent,et al. Generalized Denoising Auto-Encoders as Generative Models , 2013, NIPS.

[12] Antonio Torralba,et al. Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[13] D. Heckerman,et al. Dependency networks for inference , 2000 .

[14] Tijmen Tieleman,et al. Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[15] Gang Wang,et al. Multi-modal Unsupervised Feature Learning for RGB-D Scene Labeling , 2014, ECCV.

[16] B. S. Manjunath,et al. Color and texture descriptors , 2001, IEEE Trans. Circuits Syst. Video Technol..

[17] Bart Thomee,et al. New trends and ideas in visual concept detection: the MIR flickr retrieval evaluation initiative , 2010, MIR '10.

[18] Jing Huang,et al. Audio-visual deep learning for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19] Geoffrey E. Hinton,et al. Deep Boltzmann Machines , 2009, AISTATS.

[20] Geoffrey E. Hinton,et al. Conditional Restricted Boltzmann Machines for Structured Output Prediction , 2011, UAI.

[21] Dieter Fox,et al. RGB-D Object Recognition: Features, Algorithms, and a Large Scale Benchmark , 2013, Consumer Depth Cameras for Computer Vision.

[22] Harry Joe,et al. Composite Likelihood Methods , 2012 .

[23] Andrew Zisserman,et al. Image Classification using Random Forests and Ferns , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[24] Charles R. Johnson,et al. Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[25] Simon J. Doran,et al. Stacked Autoencoders for Unsupervised Feature Learning and Multiple Organ Detection in a Pilot Study Using 4D Patient Data , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26] Geoffrey E. Hinton,et al. Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[27] Cordelia Schmid,et al. Multimodal semi-supervised learning for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[28] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[29] Ruslan Salakhutdinov,et al. Learning Stochastic Feedforward Neural Networks , 2013, NIPS.

[30] Trevor Darrell,et al. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[31] Yoshua Bengio,et al. Deep Generative Stochastic Networks Trainable by Backprop , 2013, ICML.

[32] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[33] Mark J. Huiskes,et al. The MIR flickr retrieval evaluation , 2008, MIR '08.

[34] Nitish Srivastava,et al. Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..