25th Annual Conference on Learning Theory Divergences and Risks for Multiclass Experiments

Csisz ar’s f-divergence is a way to measure the similarity of two probability distributions. We study the extension of f-divergence to more than two distributions to measure their joint similarity. By exploiting classical results from the comparison of experiments literature we prove the resulting divergence satises all the same properties as the traditional binary one. Considering the multidistribution case actually makes the proofs simpler. The key to these results is a formal bridge between these multidistribution f-divergences and Bayes risks for multiclass classication problems.

[1]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[2]  Igor Vajda,et al.  On Pairs of $f$ -Divergences and Their Joint Range , 2010, IEEE Transactions on Information Theory.

[3]  D. Lindley On a Measure of the Information Provided by an Experiment , 1956 .

[4]  F. Österreicher,et al.  Divergenzen von Wahrscheinlichkeitsverteilungen — Integralgeometrisch Betrachtet , 1981 .

[5]  D. Blackwell Comparison of Experiments , 1951 .

[6]  Ming Li Information Distance and Its Extensions , 2011, ALT.

[7]  I. Vajda On thef-divergence and singularity of probability measures , 1972 .

[8]  Andrea Sgarro,et al.  Informational divergence and the dissimilarity of probability distributions , 1981 .

[9]  Mark D. Reid,et al.  Composite Binary Losses , 2009, J. Mach. Learn. Res..

[10]  R. A. Silverman,et al.  Introductory Real Analysis , 1972 .

[11]  Paul M. B. Vitányi,et al.  Information Distance in Multiples , 2009, IEEE Transactions on Information Theory.

[12]  Abraham Wald,et al.  Statistical Decision Functions , 1951 .

[13]  W. H. Pun,et al.  Statistical Decision Theory , 2014 .

[14]  Leandro Pardo,et al.  Informational distances and related statistics in mixed continuous and categorical variables , 1998 .

[15]  J. Ginebra,et al.  When is one experiment ‘always better than’ another? , 2003 .

[16]  I. Vajda,et al.  f-DIVERGENCES : SUFFICIENCY , DEFICIENCY AND TESTING OF HYPOTHESES , 2006 .

[17]  Moshe Shaked,et al.  Stochastic orders and their applications , 1994 .

[18]  Igor Vajda,et al.  About distances of discrete distributions satisfying the data processing theorem of information theory , 1997, IEEE Trans. Inf. Theory.

[19]  Naftali Tishby,et al.  The Information Bottleneck Revisited or How to Choose a Good Distortion Measure , 2007, 2007 IEEE International Symposium on Information Theory.

[20]  D. Blackwell Equivalent Comparisons of Experiments , 1953 .

[21]  Péter Gács,et al.  Information Distance , 1998, IEEE Trans. Inf. Theory.

[22]  Bastian Goldlücke,et al.  Variational Analysis , 2014, Computer Vision, A Reference Guide.

[23]  Igor Vajda,et al.  On Divergences and Informations in Statistics and Information Theory , 2006, IEEE Transactions on Information Theory.

[24]  Neri Merhav,et al.  Data Processing Theorems and the Second Law of Thermodynamics , 2010, IEEE Transactions on Information Theory.

[25]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[26]  K. Matusita,et al.  Some properties of affinity and applications , 1971 .

[27]  M. Degroot Optimal Statistical Decisions , 1970 .

[28]  A. Shiryaev,et al.  Statistical Experiments and Decisions: Asymptotic Theory , 2000 .

[29]  Mark D. Reid,et al.  Information, Divergence and Risk for Binary Experiments , 2009, J. Mach. Learn. Res..

[30]  M. Degroot Uncertainty, Information, and Sequential Experiments , 1962 .

[31]  E. Torgersen STOCHASTIC ORDERS AND COMPARISON OF EXPERIMENTS , 1991 .

[32]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[33]  T. Nemetz,et al.  f-dissimilarity: A generalization of the affinity of several distributions , 1978 .

[34]  L. Pardo,et al.  A preliminary test in classification and probabilities of misclassification , 2005 .

[35]  James O. Berger Statistical Decision Theory , 1980 .

[36]  Paul M. B. Vitányi,et al.  Information Distance: New Developments , 2012, ArXiv.

[37]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[38]  N. Glick Separation and probability of correct classification among two or more distributions , 1973 .

[39]  Michèle Basseville,et al.  Divergence measures for statistical data processing , 2010 .

[40]  A. Birnbaum ON THE FOUNDATIONS OF STATISTICAL INFERENCE: BINARY EXPERIMENTS' , 1961 .

[41]  K. Matusita On the notion of affinity of several distributions and some of its applications , 1967 .

[42]  R. Sibson Information radius , 1969 .

[43]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[44]  J. Ziv,et al.  A Generalization of the Rate-Distortion Theory and Applications , 1975 .

[45]  G. Toussaint Probability of error, expected divergence, and the affinity of several distributions , 1978 .

[46]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[47]  K. Zografos,et al.  f-Dissimilarity of Several Distributions in Testing Statistical Hypotheses , 1998 .

[48]  Bin Ma,et al.  Information shared by many objects , 2008, CIKM '08.

[49]  Ferdinand Österreicher,et al.  Statistical information and discrimination , 1993, IEEE Trans. Inf. Theory.

[50]  E. Torgersen Comparison of experiments when the parameter space is finite , 1970 .

[51]  Erik N. Torgersen,et al.  Measures of Information Based on Comparison with Total Information and with Total Ignorance , 1981 .