A Survey of Multi-View Representation Learning

Recently, multi-view representation learning has become a rapidly growing direction in machine learning and data mining areas. This paper introduces two categories for multi-view representation learning: multi-view representation alignment and multi-view representation fusion. Consequently, we first review the representative methods and theories of multi-view representation learning based on the perspective of alignment, such as correlation-based alignment. Representative examples are canonical correlation analysis (CCA) and its several extensions. Then, from the perspective of representation fusion, we investigate the advancement of multi-view representation learning that ranges from generative methods including multi-modal topic learning, multi-view sparse coding, and multi-view latent space Markov networks, to neural network-based methods including multi-modal autoencoders, multi-view convolutional neural networks, and multi-modal recurrent neural networks. Further, we also investigate several important applications of multi-view representation learning. Overall, this survey aims to provide an insightful overview of theoretical foundation and state-of-the-art developments in the field of multi-view representation learning and to help researchers find the most appropriate tools for particular applications.

[1]  Hongwei Sun,et al.  Convergence rate of kernel canonical correlation analysis , 2011 .

[2]  Chong-Wah Ngo,et al.  Mutlimodal Learning with Deep Boltzmann Machine for Emotion Prediction in User Generated Videos , 2015, ICMR.

[3]  Bernhard Schölkopf,et al.  Randomized Nonlinear Component Analysis , 2014, ICML.

[4]  Bingbing Ni,et al.  Temporal Action Localization with Pyramid of Score Distribution Features , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Herman Wold,et al.  Soft modelling: The Basic Design and Some Extensions , 1982 .

[6]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[7]  Shotaro Akaho,et al.  A kernel method for canonical correlation analysis , 2006, ArXiv.

[8]  Rabab Kreidieh Ward,et al.  Deep Sentence Embedding Using Long Short-Term Memory Networks: Analysis and Application to Information Retrieval , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Daoqiang Zhang,et al.  Multi-view dimensionality reduction via canonical random correlation analysis , 2015, Frontiers of Computer Science.

[10]  Rong Jin,et al.  Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison , 2012, NIPS.

[11]  Pengtao Xie,et al.  Multi-Modal Distance Metric Learning , 2013, IJCAI.

[12]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[13]  Joachim M. Buhmann,et al.  Correlated random features for fast semi-supervised learning , 2013, NIPS.

[14]  Alexander J. Smola,et al.  Fastfood - Computing Hilbert Space Expansions in loglinear time , 2013, ICML.

[15]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Joelle Pineau,et al.  Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[17]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[18]  Christian Jutten,et al.  Multimodal Data Fusion: An Overview of Methods, Challenges, and Prospects , 2015, Proceedings of the IEEE.

[19]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003 .

[20]  Karen Livescu,et al.  Large-Scale Approximate Kernel Canonical Correlation Analysis , 2015, ICLR.

[21]  Subhransu Maji,et al.  Multi-view Convolutional Neural Networks for 3D Shape Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  Léon Bottou,et al.  Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics , 2014, EMNLP.

[23]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[24]  Jesús Martínez del Rincón,et al.  Recurrent Convolutional Network for Video-Based Person Re-identification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Gregory Shakhnarovich,et al.  Learning task-specific similarity , 2005 .

[26]  Xirong Li,et al.  Word2VisualVec: Cross-Media Retrieval by Visual Feature Prediction , 2016, ArXiv.

[27]  Jeff G. Schneider,et al.  Multi-Label Output Codes using Canonical Correlation Analysis , 2011, AISTATS.

[28]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[29]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[30]  Krystian Mikolajczyk,et al.  Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Raman Arora,et al.  Kernel CCA for multi-view learning of acoustic features using articulatory measurements , 2012, MLSLP.

[32]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[33]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[34]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[35]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[36]  Qi Tian,et al.  Discriminant Learning Through Multiple Principal Angles for Visual Recognition , 2012, IEEE Transactions on Image Processing.

[37]  Ruslan Salakhutdinov,et al.  Multimodal Neural Language Models , 2014, ICML.

[38]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[39]  G. Golub,et al.  The canonical correlations of matrix pairs and their numerical computation , 1992 .

[40]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[41]  Jeff A. Bilmes,et al.  On Deep Multi-View Representation Learning , 2015, ICML.

[42]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Fei-Fei Li,et al.  Spatially Coherent Latent Topic Model for Concurrent Segmentation and Classification of Objects and Scenes , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[44]  Guodong Guo,et al.  Joint estimation of age, gender and ethnicity: CCA vs. PLS , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[45]  Yan Liu,et al.  Collaborative Topic Regression with Social Matrix Factorization for Recommendation Systems , 2012, ICML.

[46]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[47]  Anusua Trivedi,et al.  Exploiting tag and word correlations for improved webpage clustering , 2010, SMUC '10.

[48]  Wei Chen,et al.  Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework , 2015, AAAI.

[49]  Philip S. Yu,et al.  A probabilistic framework for relational clustering , 2007, KDD '07.

[50]  Colin Fyfe,et al.  Kernel and Nonlinear Canonical Correlation Analysis , 2000, IJCNN.

[51]  Elia Bruni,et al.  Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[52]  Jason Weston,et al.  Large scale image annotation: learning to rank with joint word-image embeddings , 2010, Machine Learning.

[53]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[54]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[55]  Yueting Zhuang,et al.  Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment , 2015, ACM Multimedia.

[56]  M. Barker,et al.  Partial least squares for discrimination , 2003 .

[57]  Chong Wang,et al.  Simultaneous image classification and annotation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[59]  Chong Wang,et al.  Collaborative topic modeling for recommending scientific articles , 2011, KDD.

[60]  Michael Jones,et al.  An improved deep learning architecture for person re-identification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Shiliang Sun,et al.  A survey of multi-view machine learning , 2013, Neural Computing and Applications.

[62]  William W. Hsieh,et al.  Nonlinear canonical correlation analysis by neural networks , 2000, Neural Networks.

[63]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[64]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[65]  Samy Bengio,et al.  Zero-Shot Learning by Convex Combination of Semantic Embeddings , 2013, ICLR.

[66]  Yueting Zhuang,et al.  Sparse Unsupervised Dimensionality Reduction for Multiple View Data , 2012, IEEE Transactions on Circuits and Systems for Video Technology.

[67]  Christoph H. Lampert,et al.  Correlational spectral clustering , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[68]  Carina Silberer,et al.  Learning Grounded Meaning Representations with Autoencoders , 2014, ACL.

[69]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[70]  Hang Li,et al.  “ Tony ” DNN Embedding for “ Tony ” Selective Read for “ Tony ” ( a ) Attention-based Encoder-Decoder ( RNNSearch ) ( c ) State Update s 4 SourceVocabulary Softmax Prob , 2016 .

[71]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[72]  Josef Kittler,et al.  Discriminative Learning and Recognition of Image Set Classes Using Canonical Correlations , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[73]  Quoc V. Le,et al.  A Neural Conversational Model , 2015, ArXiv.

[74]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[75]  Yi Zhen,et al.  Co-Regularized Hashing for Multimodal Data , 2012, NIPS.

[76]  Koray Kavukcuoglu,et al.  Multiple Object Recognition with Visual Attention , 2014, ICLR.

[77]  Malte Kuss,et al.  The Geometry Of Kernel Canonical Correlation Analysis , 2003 .

[78]  Larry S. Davis,et al.  Human detection using partial least squares analysis , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[79]  Jing Huang,et al.  Audio-visual deep learning for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[80]  Jieping Ye,et al.  A least squares formulation for canonical correlation analysis , 2008, ICML '08.

[81]  Ning Chen,et al.  Predictive Subspace Learning for Multi-view Data: a Large Margin Approach , 2010, NIPS.

[82]  Yasuyuki Matsushita,et al.  RotationNet: Joint Object Categorization and Pose Estimation Using Multiviews from Unsupervised Viewpoints , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[83]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[84]  Alfred O. Hero,et al.  A greedy approach to sparse canonical correlation analysis , 2008, 0801.2748.

[85]  Liang Ge,et al.  Multi-source deep learning for information trustworthiness estimation , 2013, KDD.

[86]  Samy Bengio,et al.  A Discriminative Kernel-Based Approach to Rank Images from Text Queries , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[87]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[88]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[89]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[90]  Yueting Zhuang,et al.  Supervised Coupled Dictionary Learning with Group Structures for Multi-modal Retrieval , 2013, AAAI.

[91]  Ian D. Reid,et al.  Multi-modal Auto-Encoders as Joint Estimators for Robotics Scene Understanding , 2016, Robotics: Science and Systems.

[92]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[93]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[94]  Hongxun Yao,et al.  Learning Cross Space Mapping via DNN Using Large Scale Click-Through Logs , 2015, IEEE Transactions on Multimedia.

[95]  Chong-sun Kim Canonical Analysis of Several Sets of Variables , 1973 .

[96]  Trevor Darrell,et al.  Factorized Latent Spaces with Structured Sparsity , 2010, NIPS.

[97]  Xi Chen,et al.  Structured Sparse Canonical Correlation Analysis , 2012, AISTATS.

[98]  Bernhard Schölkopf,et al.  Kernel Methods for Measuring Independence , 2005, J. Mach. Learn. Res..

[99]  Nathan Srebro,et al.  Stochastic optimization for PCA and PLS , 2012, 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[100]  Hamid R. Rabiee,et al.  MDL-CW: A Multimodal Deep Learning Framework with CrossWeights , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[101]  Gert R. G. Lanckriet,et al.  Finding Musically Meaningful Words by Sparse CCA , 2007 .

[102]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[103]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[104]  Jieping Ye,et al.  A scalable two-stage approach for a class of dimensionality reduction techniques , 2010, KDD.

[105]  Marie-Francine Moens,et al.  Imagined Visual Representations as Multimodal Embeddings , 2017, AAAI.

[106]  Geoffrey E. Hinton,et al.  Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[107]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[108]  Ishwar K. Sethi,et al.  Multimedia content processing through cross-modal association , 2003, MULTIMEDIA '03.

[109]  Suzanna Becker,et al.  Mutual information maximization: models of cortical self-organization. , 1996, Network.

[110]  Sabine Schulte im Walde,et al.  A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities , 2013, EMNLP.

[111]  Yao Zhao,et al.  Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[112]  Xinlei Chen,et al.  Learning a Recurrent Visual Representation for Image Caption Generation , 2014, ArXiv.

[113]  Li Fei-Fei,et al.  End-to-End Learning of Action Detection from Frame Glimpses in Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[114]  Christos Boutsidis,et al.  Efficient Dimensionality Reduction for Canonical Correlation Analysis , 2012, SIAM J. Sci. Comput..

[115]  Colin Fyfe,et al.  Canonical correlation analysis using artificial neural networks , 1998, ESANN.

[116]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[117]  Dean P. Foster,et al.  Multi-View Learning of Word Embeddings via CCA , 2011, NIPS.

[118]  John Shawe-Taylor,et al.  Sparse canonical correlation analysis , 2009, Machine Learning.

[119]  Michael Collins,et al.  New Ranking Algorithms for Parsing and Tagging: Kernels over Discrete Structures, and the Voted Perceptron , 2002, ACL.

[120]  Marcus Rohrbach,et al.  Multimodal Video Description , 2016, ACM Multimedia.

[121]  Alexandre d'Aspremont,et al.  Full regularization path for sparse principal component analysis , 2007, ICML '07.

[122]  Dean P. Foster,et al.  Two Step CCA: A new spectral method for estimating vector models of words , 2012, ICML 2012.

[123]  Kristen Grauman,et al.  Learning the Relative Importance of Objects from Tagged Images for Retrieval and Cross-Modal Search , 2011, International Journal of Computer Vision.

[124]  Zi Huang,et al.  Inter-media hashing for large-scale retrieval from heterogeneous data sources , 2013, SIGMOD '13.

[125]  Michael I. Jordan,et al.  A Direct Formulation for Sparse Pca Using Semidefinite Programming , 2004, NIPS 2004.

[126]  Geoffrey J. Gordon,et al.  Relational learning via collective matrix factorization , 2008, KDD.

[127]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[128]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[129]  Joel A. Tropp,et al.  Improved Analysis of the subsampled Randomized Hadamard Transform , 2010, Adv. Data Sci. Adapt. Anal..

[130]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[131]  Raghavendra Udupa,et al.  Learning Hash Functions for Cross-View Similarity Search , 2011, IJCAI.

[132]  Rong Yan,et al.  Mining Associated Text and Images with Dual-Wing Harmoniums , 2005, UAI.

[133]  Yuan Yan Tang,et al.  Multiview Hessian discriminative sparse coding for image annotation , 2013, Comput. Vis. Image Underst..

[134]  Dacheng Tao,et al.  A Survey on Multi-view Learning , 2013, ArXiv.

[135]  John Shawe-Taylor,et al.  Convergence analysis of kernel Canonical Correlation Analysis: theory and practice , 2008, Machine Learning.

[136]  Colin Fyfe,et al.  A neural implementation of canonical correlation analysis , 1999, Neural Networks.

[137]  Svetlana Lazebnik,et al.  Iterative quantization: A procrustean approach to learning binary codes , 2011, CVPR 2011.

[138]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[139]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[140]  Marc Niethammer,et al.  Robust Multimodal Dictionary Learning , 2013, MICCAI.

[141]  Dit-Yan Yeung,et al.  Collaborative Deep Learning for Recommender Systems , 2014, KDD.

[142]  Jeff A. Bilmes,et al.  Unsupervised learning of acoustic features via deep canonical correlation analysis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[143]  A. Zwinderman,et al.  Statistical Applications in Genetics and Molecular Biology Quantifying the Association between Gene Expressions and DNA-Markers by Penalized Canonical Correlation Analysis , 2011 .

[144]  Nathan Srebro,et al.  Stochastic optimization for deep CCA via nonlinear orthogonal iterations , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[145]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[146]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[147]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[148]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[149]  Makoto Yamada,et al.  Consistent Collective Matrix Completion under Joint Low Rank Structure , 2014, AISTATS.

[150]  Samy Bengio,et al.  Links between perceptrons, MLPs and SVMs , 2004, ICML.

[151]  Wu-Jun Li,et al.  Deep Cross-Modal Hashing , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[152]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[153]  Zi Huang,et al.  Linear cross-modal hashing for efficient multimedia search , 2013, ACM Multimedia.

[154]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[155]  Haoxiang Wang,et al.  Supervised cross-modal factor analysis , 2015, ArXiv.

[156]  Fei-Fei Li,et al.  Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[157]  Yueting Zhuang,et al.  Learning of Multimodal Representations With Random Walks on the Click Graph , 2016, IEEE Transactions on Image Processing.

[158]  Michael Elad,et al.  Pixels that sound , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[159]  Jun Yu,et al.  Click Prediction for Web Image Reranking Using Multimodal Sparse Coding , 2014, IEEE Transactions on Image Processing.

[160]  Yanjun Qi,et al.  Learning to rank with (a lot of) word features , 2010, Information Retrieval.

[161]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[162]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[163]  Mohan S. Kankanhalli,et al.  Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[164]  David Zhang,et al.  Joint Learning of Single-Image and Cross-Image Representations for Person Re-identification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[165]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[166]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[167]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[168]  Zhou Yu,et al.  Discriminative coupled dictionary hashing for fast cross-media retrieval , 2014, SIGIR.

[169]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[170]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[171]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[172]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[173]  Geoffrey E. Hinton,et al.  Self-organizing neural network that discovers surfaces in random-dot stereograms , 1992, Nature.

[174]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[175]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[176]  Andreas Bartels,et al.  Semi-supervised kernel canonical correlation analysis with application to human fMRI , 2011, Pattern Recognit. Lett..

[177]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[178]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[179]  Thore Graepel,et al.  Large Margin Rank Boundaries for Ordinal Regression , 2000 .

[180]  Geoffrey E. Hinton,et al.  Replicated Softmax: an Undirected Topic Model , 2009, NIPS.

[181]  Angeliki Lazaridou,et al.  Combining Language and Vision with a Multimodal Skip-gram Model , 2015, NAACL.

[182]  Sham M. Kakade,et al.  Multi-view Regression Via Canonical Correlation Analysis , 2007, COLT.

[183]  Dean P. Foster,et al.  Large Scale Canonical Correlation Analysis with Iterative Least Squares , 2014, NIPS.

[184]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[185]  Kevin Gimpel,et al.  Deep Multilingual Correlation for Improved Word Embeddings , 2015, NAACL.

[186]  Sham M. Kakade,et al.  Multi-view clustering via canonical correlation analysis , 2009, ICML '09.

[187]  Ming Liu,et al.  Multimodal DBN for Predicting High-Quality Answers in cQA portals , 2013, ACL.

[188]  Matthew Brand,et al.  Incremental Singular Value Decomposition of Uncertain Data with Missing Values , 2002, ECCV.

[189]  Kenji Fukumizu,et al.  Statistical Consistency of Kernel Canonical Correlation Analysis , 2007 .

[190]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[191]  Nikos Paragios,et al.  Data fusion through cross-modality metric learning using similarity-sensitive hashing , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[192]  Carla E. Brodley,et al.  Correlation Clustering for Learning Mixtures of Canonical Correlation Models , 2005, SDM.

[193]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[194]  Yoshihiro Yamanishi,et al.  Extraction of correlated gene clusters from multiple genomic data by generalized kernel canonical correlation analysis , 2003, ISMB.

[195]  Raman Arora,et al.  Multi-view CCA-based acoustic features for phonetic recognition across speakers and domains , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[196]  Xuelong Li,et al.  Spectral Multimodal Hashing and Its Application to Multimedia Retrieval , 2016, IEEE Transactions on Cybernetics.

[197]  Dean P. Foster Multi-View Dimensionality Reduction via Canonical Correlation Multi-View Dimensionality Reduction via Canonical Correlation Analysis Analysis Multi-View Dimensionality Reduction via Canonical Correlation Analysis Multi-View Dimensionality Reduction via Canonical Correlation Analysis Multi-View Dimen , 2008 .

[198]  Zhou Yu,et al.  Sparse Multi-Modal Hashing , 2014, IEEE Transactions on Multimedia.

[199]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[200]  Ling Guan,et al.  Kernel Cross-Modal Factor Analysis for Information Fusion With Application to Bimodal Emotion Recognition , 2012, IEEE Transactions on Multimedia.

[201]  Xiaodong He,et al.  A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems , 2015, WWW.

[202]  Jing Liu,et al.  Image annotation using multi-correlation probabilistic matrix factorization , 2010, ACM Multimedia.

[203]  Xirong Li,et al.  Word2VisualVec: Image and Video to Sentence Matching by Visual Feature Prediction , 2016 .