Metric Entropy and Minimax Risk in Classification

We apply recent results on the minimax risk in density estimation to the related problem of pattern classification. The notion of loss we seek to minimize is an information theoretic measure of how well we can predict the classification of future examples, given the classification of previously seen examples. We give an asymptotic characterization of the minimax risk in terms of the metric entropy properties of the class of distributions that might be generating the examples. We then use these results to characterize the minimax risk in the special case of noisy two-valued classification problems in terms of the Assouad density and the Vapnik-Chervonenkis dimension.

[1]  A. Kolmogorov,et al.  Entropy and "-capacity of sets in func-tional spaces , 1961 .

[2]  G. Clements Entropies of several sets of real valued functions , 1963 .

[3]  B. Clarke Asymptotic cumulative risk and Bayes risk under entropy loss, with applications , 1989 .

[4]  P. Gänssler Weak Convergence and Empirical Processes - A. W. van der Vaart; J. A. Wellner. , 1997 .

[5]  R. Dudley A course on empirical processes , 1984 .

[6]  Andrew R. Barron,et al.  Information-theoretic asymptotics of Bayes methods , 1990, IEEE Trans. Inf. Theory.

[7]  E. Giné,et al.  Some Limit Theorems for Empirical Processes , 1984 .

[8]  David Haussler,et al.  HOW WELL DO BAYES METHODS WORK FOR ON-LINE PREDICTION OF {+- 1} VALUES? , 1992 .

[9]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[10]  Opper,et al.  Bounds for predictive errors in the statistical mechanics of supervised learning. , 1995, Physical review letters.

[11]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[12]  A. Izenman Recent Developments in Nonparametric Density Estimation , 1991 .

[13]  L. Devroye,et al.  Nonparametric density estimation : the L[1] view , 1987 .

[14]  David Haussler,et al.  Occam's Razor , 1987, Inf. Process. Lett..

[15]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[16]  D. Haussler,et al.  MUTUAL INFORMATION, METRIC ENTROPY, AND RISK IN ESTIMATION OF PROBABILITY DISTRIBUTIONS , 1996 .

[17]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[18]  D. Pollard Empirical Processes: Theory and Applications , 1990 .

[19]  A. Barron Are Bayes Rules Consistent in Information , 1987 .

[20]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[21]  P. Assouad Densité et dimension , 1983 .

[22]  Thomas M. Cover,et al.  Open Problems in Communication and Computation , 2011, Springer New York.

[23]  P. Massart,et al.  Rates of convergence for minimum contrast estimators , 1993 .

[24]  I. Ibragimov,et al.  On density estimation in the view of Kolmogorov's ideas in approximation theory , 1990 .

[25]  W. Wong,et al.  Probability inequalities for likelihood ratios and convergence rates of sieve MLEs , 1995 .

[26]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[27]  A. Barron,et al.  Jeffreys' prior is asymptotically least favorable under entropy risk , 1994 .

[28]  Lucien Birgé Approximation dans les espaces métriques et théorie de l'estimation , 1983 .

[29]  L. L. Cam,et al.  Asymptotic Methods In Statistical Decision Theory , 1986 .

[30]  David Haussler,et al.  General bounds on the mutual information between a parameter and n conditionally independent observations , 1995, COLT '95.

[31]  G. Lorentz Approximation of Functions , 1966 .

[32]  Bin Yu,et al.  Lower Bounds on Expected Redundancy for Nonparametric Classes , 1996, IEEE Trans. Inf. Theory.

[33]  L. Birge,et al.  On estimating a density using Hellinger distance and some other strange facts , 1986 .

[34]  D. Haussler,et al.  Information Bounds for the Risk of Bayesian Predictions and the Redundancy of Universal Codes , 1993, Proceedings. IEEE International Symposium on Information Theory.

[35]  Leslie G. Valiant,et al.  A general lower bound on the number of examples needed for learning , 1988, COLT '88.

[36]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[37]  S. Geer Hellinger-Consistency of Certain Nonparametric Maximum Likelihood Estimators , 1993 .

[38]  Yishay Mansour,et al.  Optimal universal learning and prediction of probabilistic concepts , 1995, Proceedings of 1995 IEEE International Symposium on Information Theory.

[39]  Norbert Sauer,et al.  On the Density of Families of Sets , 1972, J. Comb. Theory, Ser. A.

[40]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[41]  Jorma Rissanen,et al.  Density estimation by stochastic complexity , 1992, IEEE Trans. Inf. Theory.

[42]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[43]  Luc Devroye,et al.  Nonparametric Density Estimation , 1985 .