Wasserstein Contrastive Representation Distillation

The primary goal of knowledge distillation (KD) is to encapsulate the information of a model learned from a teacher network into a student network, with the latter being more compact than the former. Existing work, e.g., using Kullback-Leibler divergence for distillation, may fail to capture important structural knowledge in the teacher network and often lacks the ability for feature generalization, particularly in situations when teacher and student are built to address different classification tasks. We propose Wasserstein Contrastive Representation Distillation (WCoRD), which leverages both primal and dual forms of Wasserstein distance for KD. The dual form is used for global knowledge transfer, yielding a contrastive learning objective that maximizes the lower bound of mutual information between the teacher and the student networks. The primal form is used for local contrastive knowledge transfer within a mini-batch, effectively matching the distributions of features between the teacher and the student networks. Experiments demonstrate that the proposed WCoRD method outperforms state-of-the-art approaches on privileged information distillation, model compression and cross-modal transfer.

[1]  Ursula Schmidt-Erfurth,et al.  THE PATHOPHYSIOLOGY OF GEOGRAPHIC ATROPHY SECONDARY TO AGE-RELATED MACULAR DEGENERATION AND THE COMPLEMENT PATHWAY AS A THERAPEUTIC TARGET , 2017, Retina.

[2]  Yu Cheng,et al.  Contrastive Distillation on Intermediate Representations for Language Model Compression , 2020, EMNLP.

[3]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[4]  Mikhail Khodak,et al.  A Theoretical Analysis of Contrastive Unsupervised Representation Learning , 2019, ICML.

[5]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[7]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[8]  Sergey Levine,et al.  Wasserstein Dependency Measure for Representation Learning , 2019, NeurIPS.

[9]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[10]  Gabriel Peyré,et al.  Computational Optimal Transport , 2018, Found. Trends Mach. Learn..

[11]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[12]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[13]  C. Villani Optimal Transport: Old and New , 2008 .

[14]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[15]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[16]  R Devon Hjelm,et al.  Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[17]  Zhe Gan,et al.  Improving Sequence-to-Sequence Learning via Optimal Transport , 2019, ICLR.

[18]  Nikos Komodakis,et al.  Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , 2016, ICLR.

[19]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[20]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[21]  Xu Lan,et al.  Knowledge Distillation by On-the-Fly Native Ensemble , 2018, NeurIPS.

[22]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[23]  Karl Stratos,et al.  Formal Limitations on the Measurement of Mutual Information , 2018, AISTATS.

[24]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[26]  Michael Tschannen,et al.  On Mutual Information Maximization for Representation Learning , 2019, ICLR.

[27]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[28]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[29]  Yonglong Tian,et al.  Contrastive Representation Distillation , 2019, ICLR.

[30]  J. Zico Kolter,et al.  Wasserstein Adversarial Examples via Projected Sinkhorn Iterations , 2019, ICML.

[31]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[32]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[33]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[34]  Lawrence Carin,et al.  Automatic threat recognition of prohibited items at aviation checkpoint with x-ray imaging: a deep learning approach , 2018, Defense + Security.

[35]  Greg Mori,et al.  Similarity-Preserving Knowledge Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Daniel S. Kermany,et al.  Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning , 2018, Cell.

[37]  J. Barkley Rosser,et al.  ON THE FOUNDATIONS OF MATHEMATICAL ECONOMICS , 2012 .

[38]  Yoshua Bengio,et al.  Understanding intermediate layers using linear classifier probes , 2016, ICLR.

[39]  Yu Cheng,et al.  Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[40]  Yu Cheng,et al.  Graph Optimal Transport for Cross-Domain Alignment , 2020, ICML.

[41]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[42]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[43]  Vladimir Vapnik,et al.  A new learning paradigm: Learning using privileged information , 2009, Neural Networks.

[44]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[45]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[46]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Hongyuan Zha,et al.  Gromov-Wasserstein Learning for Graph Matching and Node Embedding , 2019, ICML.

[48]  István Csabai,et al.  Detecting and classifying lesions in mammograms with Deep Learning , 2017, Scientific Reports.

[49]  Alexei A. Efros,et al.  Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Jiashi Feng,et al.  Revisit Knowledge Distillation: a Teacher-free Framework , 2019, ArXiv.

[51]  Ananthram Swami,et al.  Distillation as a Defense to Adversarial Perturbations Against Deep Neural Networks , 2015, 2016 IEEE Symposium on Security and Privacy (SP).

[52]  Ali Razavi,et al.  Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[53]  Lawrence Carin,et al.  Symmetric Variational Autoencoder and Connections to Adversarial Learning , 2017, AISTATS.

[54]  Tony R. Martinez,et al.  Using a Neural Network to Approximate an Ensemble of Classifiers , 2000, Neural Processing Letters.