Representation learning using step-based deep multi-modal autoencoders

Abstract Deep learning techniques have been successfully used in learning a common representation for multi-view data, wherein different modalities are projected onto a common subspace. In a broader perspective, the techniques used to investigate common representation learning falls under the categories of ‘canonical correlation-based’ approaches and ‘autoencoder-based’ approaches. In this paper, we investigate the performance of deep autoencoder-based methods on multi-view data. We propose a novel step-based correlation multi-modal deep convolution neural network (CorrMCNN) which reconstructs one view of the data given the other while increasing the interaction between the representations at each hidden layer or every intermediate step. The idea of step reconstruction reduces the constraint of reconstruction of original data, instead, the objective function is optimized for reconstruction of representative features. This helps the proposed model to generalize for representation and transfer learning tasks efficiently for high dimensional data. Finally, we evaluate the performance of the proposed model on three multi-view and cross-modal problems viz., audio articulation, cross-modal image retrieval and multilingual (cross-language) document classification. Through extensive experiments, we find that the proposed model performs much better than the current state-of-the-art deep learning techniques on all three multi-view and cross-modal tasks.

[1]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[2]  Changyin Sun,et al.  Discriminative Multi-View Interactive Image Re-Ranking , 2017, IEEE Transactions on Image Processing.

[3]  Yan Yan,et al.  Multi-label learning based deep transfer neural network for facial attribute classification , 2018, Pattern Recognit..

[4]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[5]  Zhixiang Chen,et al.  Collaborative multiview hashing , 2018, Pattern Recognit..

[6]  Raman Arora,et al.  Kernel CCA for multi-view learning of acoustic features using articulatory measurements , 2012, MLSLP.

[7]  Jun Yu,et al.  Image classification by multimodal subspace learning , 2012, Pattern Recognit. Lett..

[8]  Xuelong Li,et al.  Geometric Mean for Subspace Selection , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  H. Vinod Canonical ridge and econometrics of joint production , 1976 .

[10]  Yoshua Bengio,et al.  BilBOWA: Fast Bilingual Distributed Representations without Word Alignments , 2014, ICML.

[11]  Meng Wang,et al.  Topic driven multimodal similarity learning with multi-view voted convolutional features , 2018, Pattern Recognit..

[12]  Weifeng Liu,et al.  Multiview dimension reduction via Hessian multiset canonical correlations , 2018, Inf. Fusion.

[13]  Mubarak Shah,et al.  Scene detection in Hollywood movies and TV shows , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[14]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[15]  Xiao-Yuan Jing,et al.  Multi-view local discrimination and canonical correlation analysis for image classification , 2018, Neurocomputing.

[16]  Lin Ma,et al.  Multimodal learning for facial expression recognition , 2015, Pattern Recognit..

[17]  Weifeng Liu,et al.  Canonical correlation analysis networks for two-view image recognition , 2017, Inf. Sci..

[18]  Arun Ross,et al.  On automated source selection for transfer learning in convolutional neural networks , 2018, Pattern Recognit..

[19]  Jeff A. Bilmes,et al.  Unsupervised learning of acoustic features via deep canonical correlation analysis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Jürgen Schmidhuber,et al.  Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction , 2011, ICANN.

[21]  Dacheng Tao,et al.  Multi-View Intact Space Learning , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Dacheng Tao,et al.  Multi-View Learning With Incomplete Views , 2015, IEEE Transactions on Image Processing.

[23]  Léopold Simar,et al.  Canonical Correlation Analysis , 2015 .

[24]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003 .

[25]  Jianfeng Gao,et al.  Learning Continuous Phrase Representations for Translation Modeling , 2014, ACL.

[26]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[27]  Dan Hu,et al.  Kernel Independent Component Analysis , 2014 .

[28]  Raymond D. Kent,et al.  X‐ray microbeam speech production database , 1990 .

[29]  Hugo Larochelle,et al.  Correlational Neural Networks , 2015, Neural Computation.

[30]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Hugo Larochelle,et al.  An Autoencoder Approach to Learning Bilingual Word Representations , 2014, NIPS.

[32]  Xuelong Li,et al.  Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Karen Livescu,et al.  Nonparametric Canonical Correlation Analysis , 2015, ICML.

[34]  Dumitru Erhan,et al.  Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[36]  Klaus J. Kirchberg,et al.  Robust Face Detection Using the Hausdorff Distance , 2001, AVBPA.

[37]  Xinbo Gao,et al.  A parasitic metric learning net for breast mass classification based on mammography , 2018, Pattern Recognit..

[38]  Dean P. Foster,et al.  Large Scale Canonical Correlation Analysis with Iterative Least Squares , 2014, NIPS.

[39]  Feng Zhou,et al.  A deeply supervised residual network for HEp-2 cell classification via cross-modal transfer learning , 2018, Pattern Recognit..

[40]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[41]  B. Thompson Canonical Correlation Analysis , 1984 .