A unified view for unsupervised representation learning with density ratio estimation: Maximization of mutual information, nonlinear ICA and nonlinear subspace estimation

Unsupervised representation learning is one of the most important problems in machine learning. Recent promising methods are based on contrastive learning, which is often called self-supervised learning as well: Unsupervised representation learning is performed by solving a classification problem where class labels are automatically generated from unlabelled data. However, contrastive learning often relies on heuristic ideas, and therefore it is not easy to understand what contrastive learning is doing. In this paper, we emphasize that density ratio estimation is a promising goal for unsupervised representation learning, and promotes understanding to contrastive learning. Our primal contribution is to theoretically show that density ratio estimation unifies three frameworks for unsupervised representation learning: Maximization of mutual information (MI), nonlinear independent component analysis (ICA) and a novel framework for estimation of a lower-dimensional nonlinear subspace proposed in this paper. This unified view clarifies under what conditions contrastive learning can be regarded as maximizing MI, performing nonlinear ICA or estimating the lowerdimensional nonlinear subspace in the proposed framework. Furthermore, we also make theoretical contributions in each of the three frameworks: We show that MI for data representation can be maximized through density ratio estimation under certain conditions, while our analysis for nonlinear ICA reveals a novel insight for recovery of the latent source components, which is clearly supported by numerical experiments. In addition, the proposed new framework for nonlinear subspace estimation can be seen as a generalization of nonlinear ICA, and some theoretical conditions are established to estimate the nonlinear subspace. The unified view through density ratio estimation is useful from the practical side as well because applying a number of methods for density ratio estimation may automatically lead to practical methods for unsupervised representation learning. Following this idea, we propose two practical methods for unsupervised representation learning through density ratio estimation: The first method is an outlier-robust method for representation learning, while the second one is a sample-efficient nonlinear ICA method. Then, we theoretically investigate outlier-robustness of the proposed methods. Finally, we numerically demonstrate usefulness of the proposed methods in nonlinear ICA and through application to a downstream task for linear classification.

[1]  Su-Yun Huang,et al.  Robust Independent Component Analysis via Minimum $\gamma $-Divergence Estimation , 2013, IEEE Journal of Selected Topics in Signal Processing.

[2]  Martin J. Wainwright,et al.  Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization , 2007, NIPS.

[3]  Aapo Hyvärinen,et al.  Causal Discovery with General Non-Linear Relationships using Non-Linear ICA , 2019, UAI.

[4]  Ralph Linsker,et al.  An Application of the Principle of Maximum Information Preservation to Linear Systems , 1988, NIPS.

[5]  Alexander A. Alemi,et al.  On Variational Bounds of Mutual Information , 2019, ICML.

[6]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[7]  Diederik P. Kingma,et al.  Stochastic Gradient VB and the Variational Auto-Encoder , 2013 .

[8]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[9]  Nitish Srivastava Unsupervised Learning of Visual Representations using Videos , 2015 .

[10]  Aapo Hyvärinen,et al.  Variational Autoencoders and Nonlinear ICA: A Unifying Framework , 2019, AISTATS.

[11]  R Devon Hjelm,et al.  Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[12]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[13]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[14]  Michael I. Jordan,et al.  Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces , 2004, J. Mach. Learn. Res..

[15]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[16]  Ming Yang,et al.  A Survey of Multi-View Representation Learning , 2019, IEEE Transactions on Knowledge and Data Engineering.

[17]  Zhou Wang,et al.  Multiscale structural similarity for image quality assessment , 2003, The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003.

[18]  Aapo Hyvärinen,et al.  Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics , 2012, J. Mach. Learn. Res..

[19]  Klaus-Robert Müller,et al.  Feature Extraction for Change-Point Detection Using Stationary Subspace Analysis , 2011, IEEE Transactions on Neural Networks and Learning Systems.

[20]  Takafumi Kanamori,et al.  Density Ratio Estimation in Machine Learning , 2012 .

[21]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[22]  Seungjin Choi,et al.  Independent Component Analysis , 2009, Handbook of Natural Computing.

[23]  Matthijs Douze,et al.  Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.

[24]  K. Fukumizu,et al.  Causal Mosaic: Cause-Effect Inference via Nonlinear ICA and Ensemble Method , 2020, AISTATS.

[25]  Aapo Hyvärinen,et al.  Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA , 2016, NIPS.

[26]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Aapo Hyvärinen,et al.  Nonlinear ICA Using Auxiliary Variables and Generalized Contrastive Learning , 2018, AISTATS.

[28]  Su-Yun Huang,et al.  Robust mislabel logistic regression without modeling mislabel probabilities , 2016, Biometrics.

[29]  D. Harville Matrix Algebra From a Statistician's Perspective , 1998 .

[30]  R. H. Moore,et al.  Regression Graphics: Ideas for Studying Regressions Through Graphics , 1998, Technometrics.

[31]  S. Eguchi,et al.  Robust parameter estimation with a small bias against heavy contamination , 2008 .

[32]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[33]  Motoaki Kawanabe,et al.  Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation , 2007, NIPS.

[34]  S. Eguchi,et al.  Robust Independent Component Analysis via Minimum Divergence Estimation , 2012, 1210.5578.

[35]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[36]  Cordelia Schmid,et al.  Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .

[37]  Ker-Chau Li,et al.  Sliced Inverse Regression for Dimension Reduction , 1991 .

[38]  Michael Tschannen,et al.  On Mutual Information Maximization for Representation Learning , 2019, ICLR.

[39]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[40]  K. Jellinger Toward Brain-Computer Interfacing , 2009 .

[41]  D. Ruppert Robust Statistics: The Approach Based on Influence Functions , 1987 .

[42]  Takafumi Kanamori,et al.  Theoretical Analysis of Density Ratio Estimation , 2010, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[43]  Ersin Yumer,et al.  Self-supervised Learning of Motion Capture , 2017, NIPS.

[44]  H. B. Barlow,et al.  Possible Principles Underlying the Transformations of Sensory Messages , 2012 .

[45]  Aapo Hyvärinen,et al.  Nonlinear ICA of Temporally Dependent Stationary Sources , 2017, AISTATS.

[46]  Laurenz Wiskott,et al.  An extension of slow feature analysis for nonlinear blind source separation , 2014, J. Mach. Learn. Res..

[47]  Bernhard Schölkopf,et al.  Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations , 2018, ICML.

[48]  Ming-Hsuan Yang,et al.  Unsupervised Representation Learning by Sorting Sequences , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[49]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[50]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[51]  Aapo Hyvärinen,et al.  Robust contrastive learning and nonlinear ICA in the presence of outliers , 2019, UAI.

[52]  Yingli Tian,et al.  Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Bernhard Schölkopf,et al.  The Incomplete Rosetta Stone problem: Identifiability results for Multi-view Nonlinear ICA , 2019, UAI.

[54]  Mark D. Reid,et al.  Tighter Variational Representations of f-Divergences via Restriction to Probability Measures , 2012, ICML.

[55]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[56]  Aapo Hyvärinen,et al.  Nonlinear independent component analysis: Existence and uniqueness results , 1999, Neural Networks.

[57]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[58]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[59]  Arindam Banerjee,et al.  On Bayesian bounds , 2006, ICML.

[60]  Martial Hebert,et al.  Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[61]  Aapo Hyvärinen,et al.  Fast and robust fixed-point algorithms for independent component analysis , 1999, IEEE Trans. Neural Networks.

[62]  Jie Tang,et al.  Self-Supervised Learning: Generative or Contrastive , 2020, IEEE Transactions on Knowledge and Data Engineering.

[63]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[64]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[65]  Yoshua Bengio,et al.  Mutual Information Neural Estimation , 2018, ICML.

[66]  Masashi Sugiyama,et al.  Few-shot Domain Adaptation by Causal Mechanism Transfer , 2020, ICML.

[67]  Paolo Favaro,et al.  Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[68]  Masashi Sugiyama,et al.  Direct Density Ratio Estimation for Large-scale Covariate Shift Adaptation , 2008, SDM.