Deep Multimodal Subspace Clustering Networks

We present convolutional neural network based approaches for unsupervised multimodal subspace clustering. The proposed framework consists of three main stages—multimodal encoder, self-expressive layer, and multimodal decoder. The encoder takes multimodal data as input and fuses them to a latent space representation. The self-expressive layer is responsible for enforcing the self-expressiveness property and acquiring an affinity matrix corresponding to the data points. The decoder reconstructs the original input data. The network uses the distance between the decoder's reconstruction and the original input in its training. We investigate early, late, and intermediate fusion techniques and propose three different encoders corresponding to them for spatial fusion. The self-expressive layers and multimodal decoders are essentially the same for different spatial fusion-based approaches. In addition to various spatial fusion-based methods, an affinity fusion-based network is also proposed in which the self-expressive layer corresponding to different modalities is enforced to be the same. Extensive experiments on three datasets show that the proposed methods significantly outperform the state-of-the-art multimodal subspace clustering methods.

[1]  Jonathan Tompson,et al.  MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation , 2014, ACCV.

[2]  Huan Xu,et al.  Provable Subspace Clustering: When LRR Meets SSC , 2013, IEEE Transactions on Information Theory.

[3]  Shuicheng Yan,et al.  Multi-task low-rank affinity pursuit for image segmentation , 2011, 2011 International Conference on Computer Vision.

[4]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Alex Zhavoronkov,et al.  Applications of Deep Learning in Biomedicine. , 2016, Molecular pharmaceutics.

[6]  Ying Wu,et al.  Multibody grouping via orthogonal subspace decomposition , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[7]  René Vidal,et al.  Sparse Subspace Clustering: Algorithm, Theory, and Applications , 2012, IEEE transactions on pattern analysis and machine intelligence.

[8]  Junsong Yuan,et al.  Multi-feature Spectral Clustering with Minimax Optimization , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Wolfram Burgard,et al.  Deep Multispectral Semantic Scene Understanding of Forested Environments Using Multimodal Fusion , 2016, ISER.

[10]  Jürgen Schmidhuber,et al.  Multimodal Similarity-Preserving Hashing , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  V. D. Sa Spectral Clustering with Two Views , 2007 .

[12]  Tomas Mikolov,et al.  Efficient Large-Scale Multi-Modal Classification , 2018, AAAI.

[13]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[14]  Chunyan Miao,et al.  Online multimodal deep similarity learning with application to image retrieval , 2013, ACM Multimedia.

[15]  Allen Y. Yang,et al.  Unsupervised segmentation of natural images via lossy data compression , 2008, Comput. Vis. Image Underst..

[16]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[17]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[18]  René Vidal,et al.  A closed form solution to robust subspace estimation and clustering , 2011, CVPR 2011.

[19]  En Zhu,et al.  Deep Clustering with Convolutional Autoencoders , 2017, ICONIP.

[20]  Ehsan Elhamifar,et al.  Sparse subspace clustering , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Vishal M. Patel,et al.  Adversarial domain adaptive subspace clustering , 2018, 2018 IEEE 4th International Conference on Identity, Security, and Behavior Analysis (ISBA).

[22]  Hans-Peter Kriegel,et al.  Subspace clustering , 2012, WIREs Data Mining Knowl. Discov..

[23]  Byoung-Tak Zhang,et al.  Multimodal Residual Learning for Visual QA , 2016, NIPS.

[24]  Yong Yu,et al.  Robust Subspace Segmentation by Low-Rank Representation , 2010, ICML.

[25]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[26]  Shuicheng Yan,et al.  Robust and Efficient Subspace Segmentation via Least Squares Regression , 2012, ECCV.

[27]  René Vidal,et al.  Structured Sparse Subspace Clustering: A unified optimization framework , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Rama Chellappa,et al.  Joint Sparse Representation for Robust Multimodal Biometrics Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[30]  Yong Yu,et al.  Robust Recovery of Subspace Structures by Low-Rank Representation , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[32]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[33]  Vishal M. Patel,et al.  Large Margin Multi-Modal Triplet Metric Learning , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[34]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[35]  Christian Wolf,et al.  ModDrop: Adaptive Multi-Modal Gesture Recognition , 2014, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  Takeo Kanade,et al.  A Multibody Factorization Method for Independently Moving Objects , 1998, International Journal of Computer Vision.

[37]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Martin Jägersand,et al.  Stacked Multiscale Feature Learning for Domain Independent Medical Image Segmentation , 2014, MLMI.

[39]  Christopher J. C. Burges,et al.  Spectral clustering and transductive learning with multiple views , 2007, ICML '07.

[40]  David J. Kriegman,et al.  Acquiring linear subspaces for face recognition under variable lighting , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Marc Pollefeys,et al.  A General Framework for Motion Segmentation: Independent, Articulated, Rigid, Non-rigid, Degenerate and Non-degenerate , 2006, ECCV.

[42]  Xuran Zhao,et al.  A subspace co-training framework for multi-view clustering , 2014, Pattern Recognit. Lett..

[43]  Steffen Bickel,et al.  Multi-view clustering , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[44]  Christopher Joseph Pal,et al.  EmoNets: Multimodal deep learning approaches for emotion recognition in video , 2015, Journal on Multimodal User Interfaces.

[45]  Xiaochun Cao,et al.  Constrained Multi-View Video Face Clustering , 2015, IEEE Transactions on Image Processing.

[46]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Sham M. Kakade,et al.  Multi-view clustering via canonical correlation analysis , 2009, ICML '09.

[48]  Kun Huang,et al.  Multiscale Hybrid Linear Models for Lossy Image Representation , 2006, IEEE Transactions on Image Processing.

[49]  Lei Du,et al.  Robust Multi-View Spectral Clustering via Low-Rank and Sparse Decomposition , 2014, AAAI.

[50]  Tong Zhang,et al.  Deep Subspace Clustering Networks , 2017, NIPS.

[51]  Prudhvi Gurram,et al.  A Polarimetric Thermal Database for Face Recognition Research , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[52]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[53]  Graham W. Taylor,et al.  Deep Multimodal Learning: A Survey on Recent Advances and Trends , 2017, IEEE Signal Processing Magazine.

[54]  Dacheng Tao,et al.  Robust Face Recognition via Multimodal Deep Face Representation , 2015, IEEE Transactions on Multimedia.

[55]  Xiaochun Cao,et al.  Low-Rank Tensor Constrained Multiview Subspace Clustering , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[56]  Nikos Komodakis,et al.  A Deep Metric for Multimodal Registration , 2016, MICCAI.

[57]  Kun Zhan,et al.  Graph Learning for Multiview Clustering , 2018, IEEE Transactions on Cybernetics.

[58]  Jonathan J. Hull,et al.  A Database for Handwritten Text Recognition Research , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[59]  Vishal M. Patel,et al.  Multimodal sparse and low-rank subspace clustering , 2018, Inf. Fusion.

[60]  David J. Kriegman,et al.  Clustering appearances of objects under varying illumination conditions , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[61]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[62]  Xiaochun Cao,et al.  Diversity-induced Multi-view Subspace Clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Hal Daumé,et al.  Co-regularized Multi-view Spectral Clustering , 2011, NIPS.

[64]  Neil Martin Robertson,et al.  Deep Head Pose: Gaze-Direction Estimation in Multimodal Video , 2015, IEEE Transactions on Multimedia.

[65]  Hongdong Li,et al.  Efficient dense subspace clustering , 2014, IEEE Winter Conference on Applications of Computer Vision.

[66]  René Vidal,et al.  Ieee Journal of Selected Topics in Signal Processing, Vol. X, No. X, Month 20xx 1 Latent Space Sparse and Low-rank Subspace Clustering , 2022 .

[67]  René Vidal,et al.  Kernel sparse subspace clustering , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[68]  Vishal Patel,et al.  Domain Adaptive Subspace Clustering , 2016, BMVC.

[69]  Daniel P. Robinson,et al.  Scalable Sparse Subspace Clustering by Orthogonal Matching Pursuit , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  René Vidal,et al.  Segmenting Motions of Different Types by Unsupervised Manifold Clustering , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[71]  Sidong Liu,et al.  Multimodal Neuroimaging Feature Learning for Multiclass Diagnosis of Alzheimer's Disease , 2015, IEEE Transactions on Biomedical Engineering.

[72]  Patrice Y. Simard,et al.  Metrics and Models for Handwritten Character Recognition , 1998 .

[73]  Ronen Basri,et al.  Lambertian reflectance and linear subspaces , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.