Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities

In many signal detection and classification problems, we have knowledge of the distribution under each hypothesis, but not the prior probabilities. This paper is aimed at providing theory to quantify the performance of detection via estimating prior probabilities from either labeled or unlabeled training data. The error or risk is considered as a function of the prior probabilities. We show that the risk function is locally Lipschitz in the vicinity of the true prior probabilities, and the error of detectors based on estimated prior probabilities depends on the behavior of the risk function in this locality. In general, we show that the error of detectors based on the maximum likelihood estimate (MLE) of the prior probabilities converges to the Bayes error at a rate of <formula formulatype="inline"><tex Notation="TeX">$n^{-1/2}$</tex></formula>, where <formula formulatype="inline"> <tex Notation="TeX">$n$</tex></formula> is the number of training data. If the behavior of the risk function is more favorable, then detectors based on the MLE have errors converging to the corresponding Bayes errors at optimal rates of the form <formula formulatype="inline"><tex Notation="TeX">$n^{-(1+\alpha)/2}$</tex> </formula>, where <formula formulatype="inline"><tex Notation="TeX">$\alpha > 0$</tex></formula> is a parameter governing the behavior of the risk function with a typical value <formula formulatype="inline"><tex Notation="TeX">$\alpha = 1$</tex></formula>. The limit <formula formulatype="inline"><tex Notation="TeX">$\alpha \rightarrow{} \infty$</tex></formula> corresponds to a situation where the risk function is flat near the true probabilities, and thus insensitive to small errors in the MLE; in this case, the error of the detector based on the MLE converges to the Bayes error exponentially fast with <formula formulatype="inline"> <tex Notation="TeX">$n$</tex></formula>. We show that the bounds are achievable no matter given labeled or unlabeled training data and are minimax-optimal in the labeled case.

[1]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[2]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[3]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[4]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[5]  S. Geer,et al.  Square root penalty: Adaptation to the margin in classification and in edge estimation , 2005, math/0507422.

[6]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[7]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[8]  A. Tsybakov,et al.  Fast learning rates for plug-in classifiers , 2005, 0708.2321.

[9]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[10]  P. Massart Some applications of concentration inequalities to statistics , 2000 .

[11]  V. Koltchinskii Local Rademacher complexities and oracle inequalities in risk minimization , 2006, 0708.0083.

[12]  S. Ben-David,et al.  Combinatorial Variability of Vapnik-chervonenkis Classes with Applications to Sample Compression Schemes , 1998, Discrete Applied Mathematics.

[13]  Gábor Lugosi,et al.  Scale-sensitive Dimensions and Skeleton Estimates for Classification , 1998, Discrete Applied Mathematics.

[14]  A. Tsybakov,et al.  Fast learning rates for plug-in classifiers , 2007, 0708.2321.

[15]  Ingo Steinwart,et al.  Fast rates for support vector machines using Gaussian kernels , 2007, 0708.1838.

[16]  E. Mammen,et al.  Smooth Discrimination Analysis , 1999 .

[17]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[18]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[19]  Shun-ichi Amari,et al.  A Theory of Pattern Recognition , 1968 .