Cross-modal discriminant adversarial network

Abstract Cross-modal retrieval aims at retrieving relevant points across different modalities, such as retrieving images via texts. One key challenge of cross-modal retrieval is narrowing the heterogeneous gap across diverse modalities. To overcome this challenge, we propose a novel method termed as Cross-modal discriminant Adversarial Network (CAN). Taking bi-modal data as a showcase, CAN consists of two parallel modality-specific generators, two modality-specific discriminators, and a Cross-modal Discriminant Mechanism (CDM). To be specific, the generators project diverse modalities into a latent cross-modal discriminant space. Meanwhile, the discriminators compete against the generators to alleviate the heterogeneous discrepancy in this space, i.e., the generators try to generate unified features to confuse the discriminators, and the discriminators aim to classify the generated results. To further remove the redundancy and preserve the discrimination, we propose CDM to project the generated results into a single common space, accompanying with a novel eigenvalue-based loss. Thanks to the eigenvalue-based loss, CDM could push as much discriminative power as possible into all latent directions. To demonstrate the effectiveness of our CAN, comprehensive experiments are conducted on four multimedia datasets comparing with 15 state-of-the-art approaches.

[1]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[2]  Wei-Yun Yau,et al.  Structured AutoEncoders for Subspace Clustering , 2018, IEEE Transactions on Image Processing.

[3]  Gerhard Widmer,et al.  Deep Linear Discriminant Analysis , 2015, ICLR.

[4]  Shuicheng Yan,et al.  Dual Adversarial Autoencoders for Clustering , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[5]  David W. Jacobs,et al.  Generalized Multiview Analysis: A discriminative latent space , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Weiwei Liu,et al.  Multilabel Prediction via Cross-View Search , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[7]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[8]  Yakup Kutlu,et al.  Hessenberg Elm Autoencoder Kernel For Deep Learning , 2018 .

[9]  Wei-Yun Yau,et al.  Deep Subspace Clustering with Sparsity Prior , 2016, IJCAI.

[10]  Zenglin Xu,et al.  Multi-graph Fusion for Multi-view Spectral Clustering , 2019, Knowl. Based Syst..

[11]  Xin Huang,et al.  An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[12]  Lin Wu,et al.  Deep Linear Discriminant Analysis on Fisher Networks: A Hybrid Architecture for Person Re-identification , 2016, Pattern Recognit..

[13]  Jiwen Lu,et al.  Cross-Modal Discrete Hashing , 2018, Pattern Recognit..

[14]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[15]  Jeff A. Bilmes,et al.  On Deep Multi-View Representation Learning , 2015, ICML.

[16]  Dezhong Peng,et al.  Multimodal adversarial network for cross-modal retrieval , 2019, Knowl. Based Syst..

[17]  André Stuhlsatz,et al.  Feature Extraction With Deep Neural Networks by a Generalized Discriminant Analysis , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[18]  Shuicheng Yan,et al.  Convex Sparse Spectral Clustering: Single-View to Multi-View. , 2016, IEEE transactions on image processing : a publication of the IEEE Signal Processing Society.

[19]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[20]  Jiancheng Lv,et al.  COMIC: Multi-view Clustering Without Parameter Selection , 2019, ICML.

[21]  J KriegmanDavid,et al.  Eigenfaces vs. Fisherfaces , 1997 .

[22]  Yuxin Peng,et al.  Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks , 2016, IJCAI.

[23]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[24]  D. Jacobs,et al.  Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch , 2011, CVPR 2011.

[25]  Karen Livescu,et al.  Large-Scale Approximate Kernel Canonical Correlation Analysis , 2015, ICLR.

[26]  Xudong Lin,et al.  Deep Adversarial Metric Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Shiguang Shan,et al.  Multi-View Discriminant Analysis , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Yuxin Peng,et al.  Modality-Specific Cross-Modal Similarity Measurement With Recurrent Attention Network , 2017, IEEE Transactions on Image Processing.

[29]  Gang Wang,et al.  Online latent semantic hashing for cross-media retrieval , 2019, Pattern Recognit..

[30]  Yakup Kutlu,et al.  Generative Autoencoder Kernels on Deep Learning for Brain Activity Analysis , 2018, Natural and Engineering Sciences.

[31]  Feiping Nie,et al.  Adaptive-weighting discriminative regression for multi-view classification , 2019, Pattern Recognit..

[32]  Jie Lin,et al.  Semi-Supervised Multi-Modal Learning with Balanced Spectral Decomposition , 2020, AAAI.

[33]  Qi Tian,et al.  Generalized Semi-supervised and Structured Subspace Learning for Cross-Modal Retrieval , 2018, IEEE Transactions on Multimedia.

[34]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[35]  Lei Huang,et al.  Query-Adaptive Hash Code Ranking for Large-Scale Multi-View Visual Search , 2016, IEEE Transactions on Image Processing.

[36]  Dezhong Peng,et al.  Multi-View Linear Discriminant Analysis Network , 2019, IEEE Transactions on Image Processing.

[37]  Jiwen Lu,et al.  Sharable and Individual Multi-View Metric Learning , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Shotaro Akaho,et al.  A kernel method for canonical correlation analysis , 2006, ArXiv.

[39]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[40]  Qinghua Hu,et al.  Generalized Latent Multi-View Subspace Clustering , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[42]  Novruz Allahverdi,et al.  Deep Learning on Computerized Analysis of Chronic Obstructive Pulmonary Disease , 2020, IEEE Journal of Biomedical and Health Informatics.

[43]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[44]  Jiwen Lu,et al.  Conditional Single-View Shape Generation for Multi-View Stereo Reconstruction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Gang Wang,et al.  Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Y. Kutlu,et al.  Chronic obstructive pulmonary disease severity analysis using deep learning on multi-channel lung sounds , 2020, Turkish J. Electr. Eng. Comput. Sci..

[47]  Yuwei Ren,et al.  Supervised discrete cross-modal hashing based on kernel discriminant analysis , 2020, Pattern Recognit..

[48]  Yang Yang,et al.  Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[49]  J. Shawe-Taylor,et al.  Multi-View Canonical Correlation Analysis , 2010 .

[50]  Yuxin Peng,et al.  CM-GANs: Cross-modal Generative Adversarial Networks for Common Representation Learning , 2021 .

[51]  Shiguang Shan,et al.  Multi-view Deep Network for Cross-View Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Yang Yang,et al.  Cross-modal Retrieval with Label Completion , 2016, ACM Multimedia.

[53]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[54]  Yuxin Peng,et al.  CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network , 2017, IEEE Transactions on Multimedia.