Learning Diverse and Discriminative Representations via the Principle of Maximal Coding Rate Reduction

To learn intrinsic low-dimensional structures from high-dimensional data that most discriminate between classes, we propose the principle of Maximal Coding Rate Reduction ($\text{MCR}^2$), an information-theoretic measure that maximizes the coding rate difference between the whole dataset and the sum of each individual class. We clarify its relationships with most existing frameworks such as cross-entropy, information bottleneck, information gain, contractive and contrastive learning, and provide theoretical guarantees for learning diverse and discriminative features. The coding rate can be accurately computed from finite samples of degenerate subspace-like distributions and can learn intrinsic representations in supervised, self-supervised, and unsupervised settings in a unified manner. Empirically, the representations learned using this principle alone are significantly more robust to label corruptions in classification than those using cross-entropy, and can lead to state-of-the-art results in clustering mixed data from self-learned invariant features.

[1]  Stella X. Yu,et al.  Unsupervised Feature Learning via Non-parametric Instance Discrimination , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Zhouchen Lin,et al.  Self-Supervised Convolutional Subspace Clustering Network , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Daniel P. Robinson,et al.  Oracle Based Active Set Algorithm for Scalable Elastic Net Subspace Clustering , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Oliver Nina,et al.  A Decoder-Free Approach for Unsupervised Clustering and Manifold Learning with Random Triplet Mining , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[5]  Harry Shum,et al.  Classification via Minimum Incremental Coding Length (MICL) , 2007, NIPS.

[6]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  M. Kramer Nonlinear principal component analysis using autoassociative neural networks , 1991 .

[9]  Max Welling,et al.  Group Equivariant Convolutional Networks , 2016, ICML.

[10]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[11]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[12]  Ian D. Reid,et al.  Scalable Deep k-Subspace Clustering , 2018, ACCV.

[13]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[14]  Fei Wang,et al.  Deep Comprehensive Correlation Mining for Image Clustering , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Maurice Weiler,et al.  A General Theory of Equivariant CNNs on Homogeneous Spaces , 2018, NeurIPS.

[16]  Shai Ben-David,et al.  Multiclass Learnability and the ERM principle , 2011, COLT.

[17]  Chong Peng,et al.  LogDet Rank Minimization with Application to Subspace Clustering , 2015, Comput. Intell. Neurosci..

[18]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[19]  Tianhao Zhang,et al.  Multimodal Image Synthesis with Conditional Implicit Maximum Likelihood Estimation , 2020, International Journal of Computer Vision.

[20]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[21]  Tong Zhang,et al.  Deep Subspace Clustering Networks , 2017, NIPS.

[22]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[23]  Y. LeCun,et al.  Learning methods for generic object recognition with invariance to pose and lighting , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[24]  Guillermo Sapiro,et al.  OLE: Orthogonal Low-rank Embedding, A Plug and Play Geometric Loss for Deep Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  M. Cugmas,et al.  On comparing partitions , 2015 .

[26]  Jiwen Lu,et al.  Deep Sparse Subspace Clustering , 2017, ArXiv.

[27]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[28]  Chong You,et al.  Deep Isometric Learning for Visual Recognition , 2020, ICML.

[29]  Richard G. Baraniuk,et al.  The multiscale structure of non-differentiable image manifolds , 2005, SPIE Optics + Photonics.

[30]  R. Fergus,et al.  Learning invariant features through topographic filter maps , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  John Wright,et al.  Segmentation of Multivariate Mixed Data via Lossy Data Coding and Compression , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Pascal Vincent,et al.  Contractive Auto-Encoders: Explicit Invariance During Feature Extraction , 2011, ICML.

[33]  Stephen P. Boyd,et al.  Log-det heuristic for matrix rank minimization with applications to Hankel and Euclidean distance matrices , 2003, Proceedings of the 2003 American Control Conference, 2003..

[34]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Ali Farhadi,et al.  Unsupervised Deep Embedding for Clustering Analysis , 2015, ICML.

[37]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[38]  Jiashi Feng,et al.  Deep Adversarial Subspace Clustering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[40]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[41]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[42]  Hongdong Li,et al.  Neural Collaborative Subspace Clustering , 2019, ICML.

[43]  Dhruv Batra,et al.  Joint Unsupervised Learning of Deep Representations and Image Clusters , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  I. Jolliffe Principal Component Analysis , 2005 .

[45]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[46]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[47]  Shai Ben-David,et al.  A learning problem that is independent of the set theory ZFC axioms , 2017, ArXiv.

[48]  Artemy Kolchinsky,et al.  Caveats for information bottleneck in deterministic scenarios , 2018, ICLR.

[49]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[50]  S. Shankar Sastry,et al.  Generalized Principal Component Analysis , 2016, Interdisciplinary applied mathematics.

[51]  Gitta Kutyniok,et al.  A Rate-Distortion Framework for Explaining Neural Network Decisions , 2019, ArXiv.

[52]  Thomas M. Cover,et al.  Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing) , 2006 .

[53]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[54]  Lingfeng Wang,et al.  Deep Adaptive Image Clustering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[55]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[56]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[57]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.