Deep Learning for Multimodal Data Fusion

Abstract Recent advance in deep learning has enabled realistic image-to-image translation of multimodal data. Along with the development, auto-encoders and generative adversarial networks (GAN) have been extended to deal with multimodal input and output. At the same time, multitask learning has been shown to efficiently and effectively address multiple mutually related recognition tasks. Various scene understanding tasks, such as semantic segmentation and depth prediction, can be viewed as cross-modal encoding / decoding, and hence most of the prior work used multimodal (various types of input) datasets for multitask (various types of output) learning. The inter-modal commonalities, such as one across RGB image, depth, and semantic labels, are being exploited while the study is still at an early stage. In this chapter, we introduce several state-of-the-art encoder–decoder methods on multimodal learning as well as a new approach to cross-modal networks. In particular, we detail a multimodal encoder–decoder networks that harnesses the multimodal nature of multitask scene recognition. In addition to the shared latent representation among encoder–decoder pairs, the model also has shared skip connections from different encoders. By combining these two representation sharing mechanisms, it is shown to efficiently learn a shared feature representation among all modalities in the training data.