论文信息 - On threshold-based classification rules

On threshold-based classification rules

Abstract. Suppose we have n i.i.d. copies {(Xi, Yi), i = 1, . . . , n} of an example (X, Y ), where X ∈ X is an instance and Y ∈ {−1, 1} is a label. A decision function (or classifier) f is a function f : X → [−1, 1]. Based on f , the example (X, Y ) is misclassified if Y f(X) ≤ 0. In this paper, we first study the case X = R, and the simple decision functions ha(x) = 2l{x ≥ a} − 1 based on a threshold a ∈ R. We choose the threshold ân that minimizes the classification error in the sample, and derive its asymptotic distribution. We also show that, under monotonicity assumptions, ân is a nonparametric maximum likelihood estimator. Next, we consider more complicated classification rules based on averaging over a class of base classifiers. We allow that certain examples are not classified due to lack of evidence, and provide a uniform bound for the margin. Moreover, we illustrate that when using averaged classification rules, maximizing the number of examples with margin above a given value, can overcome the problem of overfitting. In our illustration, the classification problem then boils down to optimizing over certain threshold-based classifiers.

Sara van de Geer | L. Mohammadi | S. Geer | L. Mohammadi

[1] A. W. van der Vaart,et al. Uniform Central Limit Theorems , 2001 .

[2] Vladimir Vapnik,et al. Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[3] László Györfi,et al. A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[4] E. Mammen,et al. Smooth Discrimination Analysis , 1999 .

[5] Jon A. Wellner,et al. Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[6] P. Bühlmann,et al. Analyzing Bagging , 2001 .

[7] S. Geer. Empirical Processes in M-Estimation , 2000 .

[8] R. Tyrrell Rockafellar,et al. Convex Analysis , 1970, Princeton Landmarks in Mathematics and Physics.

[9] R. Dudley. A course on empirical processes , 1984 .

[10] Peter L. Bartlett,et al. The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[11] A. Pajor,et al. The entropy of convex bodies with ‘few’ extreme points , 1991 .

[12] Yoav Freund,et al. Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[13] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[14] G. F. Clements. Entropies of sets of functions of bounded variation , 1963 .

[15] D. Pollard,et al. Cube Root Asymptotics , 1990 .

[16] F. T. Wright,et al. Order restricted statistical inference , 1988 .

[17] Richard O. Duda,et al. Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[18] Robert E. Schapire,et al. Efficient distribution-free learning of probabilistic concepts , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[19] Yishay Mansour,et al. Why averaging classifiers can protect against overfitting , 2001, AISTATS.

[20] Vladimir N. Vapnik,et al. The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[21] J. Wellner,et al. Information Bounds and Nonparametric Maximum Likelihood Estimation , 1992 .