Multimodal Representation Learning: Advances, Trends and Challenges

Representation learning is the base and crucial for consequential tasks, such as classification, regression, and recognition. The goal of representation learning is to automatically learning good features with deep models. Multimodal representation learning is a special representation learning, which automatically learns good features from multiple modalities, and these modalities are not independent, there are correlations and associations among modalities. Furthermore, multimodal data are usually heterogeneous. Due to the characteristics, multimodal representation learning poses many difficulties: how to combine multimodal data from heterogeneous sources; how to jointly learning features from multimodal data; how to effectively describe the correlations and associations, etc. These difficulties triggered great interest of researchers along with the upsurge of deep learning, many deep multimodal learning methods have been proposed by different researchers. In this paper, we present an overview of deep multimodal learning, especially the approaches proposed within the last decades. We provide potential readers with advances, trends and challenges, which can be very helpful to researchers in the field of machine, especially for the ones engaging in the study of multimodal deep machine learning.

[1]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[2]  Zhigang Luo,et al.  Audio visual speech recognition with multimodal recurrent neural networks , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[3]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[4]  Dexin Zhao,et al.  A multimodal fusion approach for image captioning , 2019, Neurocomputing.

[5]  Petros Daras,et al.  A unified framework for multimodal retrieval , 2013, Pattern Recognit..

[6]  Yu Zheng,et al.  Methodologies for Cross-Domain Data Fusion: An Overview , 2015, IEEE Transactions on Big Data.

[7]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Jason Weston,et al.  WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.

[9]  Matthew Turk,et al.  Multimodal interaction: A review , 2014, Pattern Recognit. Lett..

[10]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[11]  Graham W. Taylor,et al.  Deep Multimodal Learning: A Survey on Recent Advances and Trends , 2017, IEEE Signal Processing Magazine.

[12]  Chien-Li Chou,et al.  Multimodal Video-to-Near-Scene Annotation , 2017, IEEE Transactions on Multimedia.

[13]  Honglak Lee,et al.  Improved Multimodal Deep Learning with Variation of Information , 2014, NIPS.

[14]  Wu-Jun Li,et al.  Deep Cross-Modal Hashing , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Xiaodong Gu,et al.  Two-Stream Convolutional Neural Network for Multimodal Matching , 2018, ICANN.

[16]  Ruslan Salakhutdinov,et al.  Multimodal Neural Language Models , 2014, ICML.

[17]  George A. Triantafyllidis,et al.  A Multimodal Interaction Framework for Blended Learning , 2016, ArtsIT/DLI.

[18]  Pradipta Maji,et al.  FaRoC: Fast and Robust Supervised Canonical Correlation Analysis for Multimodal Omics Data , 2018, IEEE Transactions on Cybernetics.

[19]  Christian Jutten,et al.  Multimodal Data Fusion: An Overview of Methods, Challenges, and Prospects , 2015, Proceedings of the IEEE.

[20]  Jong-Seok Lee,et al.  EmbraceNet: A robust deep learning architecture for multimodal classification , 2019, Inf. Fusion.

[21]  Ling Guan,et al.  Multimodal Learning for Human Action Recognition Via Bimodal/Multimodal Hybrid Centroid Canonical Correlation Analysis , 2019, IEEE Transactions on Multimedia.

[22]  Cesc Chunseong Park,et al.  Towards Personalized Image Captioning via Multimodal Memory Networks , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Sara Lu Riggs,et al.  Crossmodal Matching: A Critical but Neglected Step in Multimodal Research , 2016, IEEE Transactions on Human-Machine Systems.

[24]  Raghavendra Udupa,et al.  Learning Hash Functions for Cross-View Similarity Search , 2011, IJCAI.