Likelihood Ratio Exponential Families

The exponential family is well known in machine learning and statistical physics as the maximum entropy distribution subject to a set of observed constraints [1], while the geometric mixture path is common in MCMC methods such as annealed importance sampling (AIS) [2, 3]. Linking these two ideas, recent work [4] has interpreted the geometric mixture path as an exponential family of distributions to analyse the thermodynamic variational objective (TVO) [5]. We extend likelihood ratio exponential families to include solutions to ratedistortion (RD) optimization [6, 7], the Information Bottleneck (IB) method [8], and recent rate-distortion-classification (RDC) approaches which combine RD and IB [9, 10]. This provides a common mathematical framework for understanding these methods via the conjugate duality of exponential families and hypothesis testing. Further, we collect existing results [11–14] to provide a variational representation of intermediate RD or TVO distributions as a minimizing an expectation of KL divergences. This solution also corresponds to a size-power tradeoff using the likelihood ratio test and the Neyman Pearson lemma. In thermodynamic integration (TI) bounds [15, 16] such as the TVO, we identify the intermediate distribution whose expected sufficient statistics match the log partition function.

[1]  Alexander A. Alemi,et al.  TherML: Thermodynamics of Machine Learning , 2018, ArXiv.

[2]  Frank D. Wood,et al.  All in the Exponential Family: Bregman Duality in Thermodynamic Variational Inference , 2020, ICML.

[3]  Ruslan Salakhutdinov,et al.  Annealing between distributions by averaging moments , 2013, NIPS.

[4]  Stefano Soatto,et al.  Information Dropout: Learning Optimal Representations Through Noisy Computation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Frank D. Wood,et al.  The Thermodynamic Variational Objective , 2019, NeurIPS.

[6]  Yansong Gao,et al.  A Free-Energy Principle for Representation Learning , 2020, ICML.

[7]  Polina Golland,et al.  DEMI: Discriminative Estimator of Mutual Information , 2020, ArXiv.

[8]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[9]  Nir Friedman,et al.  The Information Bottleneck EM Algorithm , 2002, UAI.

[10]  Lizhong Zheng,et al.  I-Projection and the Geometry of Error Exponents , 2006 .

[11]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[12]  J. Borwein,et al.  Convex Analysis And Nonlinear Optimization , 2000 .

[13]  Geoffrey C. Fox,et al.  A deterministic annealing approach to clustering , 1990, Pattern Recognit. Lett..

[14]  Frank Nielsen,et al.  On the Jensen–Shannon Symmetrization of Distances Relying on Abstract Means , 2019, Entropy.

[15]  Xiao-Li Meng,et al.  Simulating Normalizing Constants: From Importance Sampling to Bridge Sampling to Path Sampling , 1998 .

[16]  Radford M. Neal Annealed importance sampling , 1998, Stat. Comput..

[17]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[18]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[19]  Frank Nielsen,et al.  A family of statistical symmetric divergences based on Jensen's inequality , 2010, ArXiv.

[20]  Makoto Yamada,et al.  Neural Methods for Point-wise Dependency Estimation , 2020, NeurIPS.

[21]  Y. Ogata A Monte Carlo method for high dimensional integration , 1989 .

[22]  Alireza Makhzani,et al.  Evaluating Lossy Compression Rates of Deep Generative Models , 2020, ICML.

[23]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[24]  C. Jarzynski Equilibrium free-energy differences from nonequilibrium measurements: A master-equation approach , 1997, cond-mat/9707325.

[25]  Alexander A. Alemi,et al.  On Variational Bounds of Mutual Information , 2019, ICML.

[26]  Jorma Rissanen,et al.  Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[27]  Naftali Tishby,et al.  Multivariate Information Bottleneck , 2001, Neural Computation.

[28]  Jacob Deasy,et al.  Constraining Variational Inference with Geometric Jensen-Shannon Divergence , 2020, NeurIPS.

[29]  Peter Harremoës,et al.  Rényi Divergence and Kullback-Leibler Divergence , 2012, IEEE Transactions on Information Theory.

[30]  Kai Xu,et al.  Telescoping Density-Ratio Estimation , 2020, NeurIPS.

[31]  K. Rose Deterministic annealing for clustering, compression, classification, regression, and related optimization problems , 1998, Proc. IEEE.

[32]  Alexander A. Alemi,et al.  Fixing a Broken ELBO , 2017, ICML.

[33]  Peter Harremoes Interpretations of Renyi Entropies And Divergences , 2005 .

[34]  Frank Nielsen,et al.  An Information-Geometric Characterization of Chernoff Information , 2013, IEEE Signal Processing Letters.

[35]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[36]  Imre Csiszár,et al.  Information Theory and Statistics: A Tutorial , 2004, Found. Trends Commun. Inf. Theory.

[37]  O. F. Cook The Method of Types , 1898 .

[38]  Frank Nielsen,et al.  A closed-form expression for the Sharma–Mittal entropy of exponential families , 2011, ArXiv.