FedCMR: Federated Cross-Modal Retrieval

Deep cross-modal retrieval methods have shown their competitiveness among different cross-modal retrieval algorithms. Generally, these methods require a large amount of training data. However, aggregating large amounts of data will incur huge privacy risks and high maintenance costs. Inspired by the recent success of federated learning, we propose the federated cross-modal retrieval (FedCMR), which learns the model with decentralized multi-modal data. Specifically, we first train the cross-modal retrieval model and learn the common space across multiple modalities in each client using its local data. Then, we jointly learn the common subspace of multiple clients on the trusted central server. Finally, each client updates the common subspace of the local model based on the aggregated common subspace on the server, so that all clients participated in the training can benefit from federated learning. Experiment results on four benchmark datasets demonstrate the effectiveness proposed method.

[1]  Richard Nock,et al.  Advances and Open Problems in Federated Learning , 2021, Found. Trends Mach. Learn..

[2]  Moming Duan,et al.  Astraea: Self-Balancing Federated Learning for Improving Classification Accuracy of Mobile Deep Learning Applications , 2019, 2019 IEEE 37th International Conference on Computer Design (ICCD).

[3]  Honglak Lee,et al.  Deep Variational Canonical Correlation Analysis , 2016, ArXiv.

[4]  Jeff A. Bilmes,et al.  On Deep Multi-View Representation Learning , 2015, ICML.

[5]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[6]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[7]  Yang Yang,et al.  Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[8]  Mark J. Huiskes,et al.  The MIR flickr retrieval evaluation , 2008, MIR '08.

[9]  J. Kettenring,et al.  Canonical Analysis of Several Sets of Variables , 2022 .

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[12]  Marta Otto Regulation (EU) 2016/679 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data (General Data Protection Regulation – GDPR) , 2018 .

[13]  Daniel Rueckert,et al.  A generic framework for privacy preserving deep learning , 2018, ArXiv.

[14]  Ji Liu,et al.  IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Dezhong Peng,et al.  Scalable Deep Multimodal Learning for Cross-Modal Retrieval , 2019, SIGIR.

[16]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[17]  Xiaohua Zhai,et al.  Learning Cross-Media Joint Representation With Sparse and Semisupervised Regularization , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[18]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[19]  Andrea Esuli,et al.  Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders , 2020, ACM Trans. Multim. Comput. Commun. Appl..

[20]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[21]  Anit Kumar Sahu,et al.  Federated Optimization in Heterogeneous Networks , 2018, MLSys.

[22]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[23]  Dezhong Peng,et al.  Deep Supervised Cross-Modal Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[25]  Wu-Jun Li,et al.  Deep Cross-Modal Hashing , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).