论文信息 - Likelihood Ratio Exponential Families

Likelihood Ratio Exponential Families

The exponential family is well known in machine learning and statistical physics as the maximum entropy distribution subject to a set of observed constraints [1], while the geometric mixture path is common in MCMC methods such as annealed importance sampling (AIS) [2, 3]. Linking these two ideas, recent work [4] has interpreted the geometric mixture path as an exponential family of distributions to analyse the thermodynamic variational objective (TVO) [5]. We extend likelihood ratio exponential families to include solutions to ratedistortion (RD) optimization [6, 7], the Information Bottleneck (IB) method [8], and recent rate-distortion-classification (RDC) approaches which combine RD and IB [9, 10]. This provides a common mathematical framework for understanding these methods via the conjugate duality of exponential families and hypothesis testing. Further, we collect existing results [11–14] to provide a variational representation of intermediate RD or TVO distributions as a minimizing an expectation of KL divergences. This solution also corresponds to a size-power tradeoff using the likelihood ratio test and the Neyman Pearson lemma. In thermodynamic integration (TI) bounds [15, 16] such as the TVO, we identify the intermediate distribution whose expected sufficient statistics match the log partition function.

[1] Alexander A. Alemi,et al. TherML: Thermodynamics of Machine Learning , 2018, ArXiv.

[2] Frank D. Wood,et al. All in the Exponential Family: Bregman Duality in Thermodynamic Variational Inference , 2020, ICML.

[3] Ruslan Salakhutdinov,et al. Annealing between distributions by averaging moments , 2013, NIPS.

[4] Stefano Soatto,et al. Information Dropout: Learning Optimal Representations Through Noisy Computation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5] Frank D. Wood,et al. The Thermodynamic Variational Objective , 2019, NeurIPS.

[6] Yansong Gao,et al. A Free-Energy Principle for Representation Learning , 2020, ICML.

[7] Polina Golland,et al. DEMI: Discriminative Estimator of Mutual Information , 2020, ArXiv.

[8] Michael I. Jordan,et al. Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[9] Nir Friedman,et al. The Information Bottleneck EM Algorithm , 2002, UAI.

[10] Lizhong Zheng,et al. I-Projection and the Geometry of Error Exponents , 2006 .

[11] Naftali Tishby,et al. The information bottleneck method , 2000, ArXiv.

[12] J. Borwein,et al. Convex Analysis And Nonlinear Optimization , 2000 .

[13] Geoffrey C. Fox,et al. A deterministic annealing approach to clustering , 1990, Pattern Recognit. Lett..

[14] Frank Nielsen,et al. On the Jensen–Shannon Symmetrization of Distances Relying on Abstract Means , 2019, Entropy.

[15] Xiao-Li Meng,et al. Simulating Normalizing Constants: From Importance Sampling to Bridge Sampling to Path Sampling , 1998 .

[16] Radford M. Neal. Annealed importance sampling , 1998, Stat. Comput..

[17] E. Jaynes. Information Theory and Statistical Mechanics , 1957 .

[18] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[19] Frank Nielsen,et al. A family of statistical symmetric divergences based on Jensen's inequality , 2010, ArXiv.

[20] Makoto Yamada,et al. Neural Methods for Point-wise Dependency Estimation , 2020, NeurIPS.

[21] Y. Ogata. A Monte Carlo method for high dimensional integration , 1989 .

[22] Alireza Makhzani,et al. Evaluating Lossy Compression Rates of Deep Generative Models , 2020, ICML.

[23] Inderjit S. Dhillon,et al. Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[24] C. Jarzynski. Equilibrium free-energy differences from nonequilibrium measurements: A master-equation approach , 1997, cond-mat/9707325.

[25] Alexander A. Alemi,et al. On Variational Bounds of Mutual Information , 2019, ICML.

[26] Jorma Rissanen,et al. Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[27] Naftali Tishby,et al. Multivariate Information Bottleneck , 2001, Neural Computation.

[28] Jacob Deasy,et al. Constraining Variational Inference with Geometric Jensen-Shannon Divergence , 2020, NeurIPS.

[29] Peter Harremoës,et al. Rényi Divergence and Kullback-Leibler Divergence , 2012, IEEE Transactions on Information Theory.

[30] Kai Xu,et al. Telescoping Density-Ratio Estimation , 2020, NeurIPS.

[31] K. Rose. Deterministic annealing for clustering, compression, classification, regression, and related optimization problems , 1998, Proc. IEEE.

[32] Alexander A. Alemi,et al. Fixing a Broken ELBO , 2017, ICML.

[33] Peter Harremoes. Interpretations of Renyi Entropies And Divergences , 2005 .

[34] Frank Nielsen,et al. An Information-Geometric Characterization of Chernoff Information , 2013, IEEE Signal Processing Letters.

[35] Alexander A. Alemi,et al. Deep Variational Information Bottleneck , 2017, ICLR.

[36] Imre Csiszár,et al. Information Theory and Statistics: A Tutorial , 2004, Found. Trends Commun. Inf. Theory.

[37] O. F. Cook. The Method of Types , 1898 .

[38] Frank Nielsen,et al. A closed-form expression for the Sharma–Mittal entropy of exponential families , 2011, ArXiv.