An effective hierarchical extreme learning machine based multimodal fusion framework

Abstract Deep learning has been successfully applied to multimodal representation learning. Similar with single modal deep learning method, such multimodal deep learning methods consist of a greedy layer-wise feedforward propagation and a backpropagation (BP) fine-tune conducted by diverse targets. These models have the drawback of time consuming. While, extreme learning machine (ELM) is a fast learning algorithm for single hidden layer feedforward neural network. And previous works has shown the effectiveness of ELM based hierarchical framework for multilayer perceptron. In this paper, we introduce an ELM based hierarchical framework for multimodal data. The proposed architecture consists of three main components: (1) self-taught feature extraction for specific modality by an ELM-based sparse autoencoder, (2) fused representation learning based on the features learned by previous step and (3) supervised feature classification based on the fused representation. This is an exact feedforward framework that once a layer is established, its weights are fixed without fine-tuning. Therefore, it has much better learning efficiency than the gradient based multimodal deep learning methods. We conduct experiments on MNIST, XRMB and NUS datasets, the proposed algorithm obtains faster convergence and achieves better classification performance compared with the other existing multimodal deep learning models.

[1]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[2]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[3]  Chunxia Zhang,et al.  Generalized extreme learning machine autoencoder and a new deep neural network , 2017, Neurocomputing.

[4]  Petros Maragos,et al.  Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition , 2009, IEEE Trans. Speech Audio Process..

[5]  Pengtao Xie,et al.  Multi-Modal Distance Metric Learning , 2013, IJCAI.

[6]  Ivor W. Tsang,et al.  Two-Layer Multiple Kernel Learning , 2011, AISTATS.

[7]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[8]  Hongming Zhou,et al.  Extreme Learning Machine for Regression and Multiclass Classification , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[9]  Dipankar Das,et al.  Enhanced SenticNet with Affective Labels for Concept-Based Opinion Mining , 2013, IEEE Intelligent Systems.

[10]  Chee Kheong Siew,et al.  Universal Approximation using Incremental Constructive Feedforward Networks with Random Hidden Nodes , 2006, IEEE Transactions on Neural Networks.

[11]  Guang-Bin Huang,et al.  Extreme Learning Machine for Multilayer Perceptron , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[12]  Ning Chen,et al.  Predictive Subspace Learning for Multi-view Data: a Large Margin Approach , 2010, NIPS.

[13]  Jeff A. Bilmes,et al.  On Deep Multi-View Representation Learning , 2015, ICML.

[14]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[15]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[16]  Honglak Lee,et al.  Improved Multimodal Deep Learning with Variation of Information , 2014, NIPS.

[17]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[18]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[19]  Rong Yan,et al.  Mining Associated Text and Images with Dual-Wing Harmoniums , 2005, UAI.

[20]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[21]  Seungjin Choi,et al.  Deep Learning to Hash with Multiple Representations , 2012, 2012 IEEE 12th International Conference on Data Mining.

[22]  Michael Wagner,et al.  Audio-visual multimodal fusion for biometric person authentication and liveness verification , 2006 .

[23]  Shotaro Akaho,et al.  A kernel method for canonical correlation analysis , 2006, ArXiv.

[24]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[25]  Bernd Freisleben,et al.  Multimodal Video Concept Detection via Bag of Auditory Words and Multiple Kernel Learning , 2012, MMM.

[26]  Dongqing Zhang,et al.  Large-Scale Supervised Multimodal Hashing with Semantic Correlation Maximization , 2014, AAAI.

[27]  Chunxia Zhang,et al.  A new deep neural network based on a stack of single-hidden-layer feedforward neural networks with randomly fixed hidden neurons , 2016, Neurocomputing.

[28]  Jieping Ye,et al.  A least squares formulation for canonical correlation analysis , 2008, ICML '08.