Learning the Information Divergence

Information divergence that measures the difference between two nonnegative matrices or tensors has found its use in a variety of machine learning problems. Examples are Nonnegative Matrix/Tensor Factorization, Stochastic Neighbor Embedding, topic models, and Bayesian network optimization. The success of such a learning task depends heavily on a suitable divergence. A large variety of divergences have been suggested and analyzed, but very few results are available for an objective choice of the optimal divergence for a given task. Here we present a framework that facilitates automatic selection of the best divergence among a given family, based on standard maximum likelihood estimation. We first propose an approximated Tweedie distribution for the β-divergence family. Selecting the best β then becomes a machine learning problem solved by maximum likelihood. Next, we reformulate α-divergence in terms of β-divergence, which enables automatic selection of α by maximum likelihood with reuse of the learning principle for β-divergence. Furthermore, we show the connections between γ- and β-divergences as well as Renyi- and α-divergences, such that our automatic selection framework is extended to non-separable divergences. Experiments on both synthetic and real-world data demonstrate that our method can quite accurately select information divergence across different learning problems and various divergence families.

[1]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[2]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[3]  T. Morimoto Markov Processes and the H -Theorem , 1963 .

[4]  M. Abramowitz,et al.  Handbook of Mathematical Functions With Formulas, Graphs and Mathematical Tables (National Bureau of Standards Applied Mathematics Series No. 55) , 1965 .

[5]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[6]  Shun-ichi Amari,et al.  Differential-geometrical methods in statistics , 1985 .

[7]  B. Jørgensen Exponential Dispersion Models , 1987 .

[8]  G. Smyth,et al.  Tweedie Family Densities: Methods of Evaluation , 1997 .

[9]  M. C. Jones,et al.  Robust and efficient estimation by minimising a density power divergence , 1998 .

[10]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[11]  Mihoko Minami,et al.  Robust Blind Source Separation by Beta Divergence , 2002, Neural Computation.

[12]  Geoffrey E. Hinton,et al.  Stochastic Neighbor Embedding , 2002, NIPS.

[13]  Victoria Stodden,et al.  When Does Non-Negative Matrix Factorization Give a Correct Decomposition into Parts? , 2003, NIPS.

[14]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[15]  Inderjit S. Dhillon,et al.  Generalized Nonnegative Matrix Approximations with Bregman Divergences , 2005, NIPS.

[16]  Aapo Hyvärinen,et al.  Estimation of Non-Normalized Statistical Models by Score Matching , 2005, J. Mach. Learn. Res..

[17]  Thomas P. Minka,et al.  Divergence measures and message passing , 2005 .

[18]  Gordon K. Smyth,et al.  Series evaluation of Tweedie exponential dispersion model densities , 2005, Stat. Comput..

[19]  Erkki Oja,et al.  Projective Nonnegative Matrix Factorization for Image Compression and Feature Extraction , 2005, SCIA.

[20]  Mihoko Minami,et al.  Robust Prewhitening for ICA by Minimizing β-Divergence and Its Application to FastICA , 2007, Neural Processing Letters.

[21]  Raul Kompass,et al.  A Generalized Divergence Measure for Nonnegative Matrix Factorization , 2007, Neural Computation.

[22]  Aapo Hyvärinen,et al.  Some extensions of score matching , 2007, Comput. Stat. Data Anal..

[23]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[24]  A. Cichocki,et al.  Nonnegative matrix factorization with -divergence , 2008 .

[25]  Andrzej Cichocki,et al.  Non-negative matrix factorization with alpha-divergence , 2008, Pattern Recognit. Lett..

[26]  S. Eguchi,et al.  Robust parameter estimation with a small bias against heavy contamination , 2008 .

[27]  Nancy Bertin,et al.  Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis , 2009, Neural Computation.

[28]  Erkki Oja,et al.  Projective Nonnegative Matrix Factorization with α-Divergence , 2009, ICANN.

[29]  Inderjit S. Dhillon,et al.  Low-Rank Kernel Learning with Bregman Matrix Divergences , 2009, J. Mach. Learn. Res..

[30]  C. Févotte,et al.  Automatic Relevance Determination in Nonnegative Matrix Factorization , 2009 .

[31]  Andrzej Cichocki,et al.  Nonnegative Matrix and Tensor Factorization T , 2007 .

[32]  Erkki Oja,et al.  Linear and Nonlinear Projective Nonnegative Matrix Factorization , 2010, IEEE Transactions on Neural Networks.

[33]  Yoonsuck Choe,et al.  Learning α-integration with partially-labeled data , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  Erkki Oja,et al.  Automatic Rank Determination in Projective Nonnegative Matrix Factorization , 2010, LVA/ICA.

[35]  Andrzej Cichocki,et al.  Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities , 2010, Entropy.

[36]  Jérôme Idier,et al.  Algorithms for nonnegative matrix factorization with the beta-divergence , 2010, ArXiv.

[37]  C. Févotte,et al.  Automatic Relevance Determination in Nonnegative Matrix Factorization with the-Divergence , 2011 .

[38]  Sergio Cruces,et al.  Generalized Alpha-Beta Divergences and Their Application to Robust Nonnegative Matrix Factorization , 2011, Entropy.

[39]  Jérôme Idier,et al.  Algorithms for Nonnegative Matrix Factorization with the β-Divergence , 2010, Neural Computation.

[40]  Erkki Oja,et al.  Unified Development of Multiplicative Algorithms for Linear and Quadratic Nonnegative Matrix Factorization , 2011, IEEE Transactions on Neural Networks.

[41]  E. Oja,et al.  Kullback-Leibler Divergence for Nonnegative for Nonnegative Matrix Factorization , 2011 .

[42]  Erkki Oja,et al.  Selecting β-Divergence for Nonnegative Matrix Factorization by Score Matching , 2012, ICANN.

[43]  Hiroshi Nakagawa,et al.  Rethinking Collapsed Variational Bayes Inference for LDA , 2012, ICML.

[44]  Ali Taylan Cemgil,et al.  Alpha/Beta Divergences and Tweedie Models , 2012, ArXiv.

[45]  Erkki Oja,et al.  Quadratic nonnegative matrix factorization , 2012, Pattern Recognit..

[46]  Vincent Y. F. Tan,et al.  Automatic Relevance Determination in Nonnegative Matrix Factorization with the /spl beta/-Divergence , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.