On threshold-based classification rules

Abstract. Suppose we have n i.i.d. copies {(Xi, Yi), i = 1, . . . , n} of an example (X, Y ), where X ∈ X is an instance and Y ∈ {−1, 1} is a label. A decision function (or classifier) f is a function f : X → [−1, 1]. Based on f , the example (X, Y ) is misclassified if Y f(X) ≤ 0. In this paper, we first study the case X = R, and the simple decision functions ha(x) = 2l{x ≥ a} − 1 based on a threshold a ∈ R. We choose the threshold ân that minimizes the classification error in the sample, and derive its asymptotic distribution. We also show that, under monotonicity assumptions, ân is a nonparametric maximum likelihood estimator. Next, we consider more complicated classification rules based on averaging over a class of base classifiers. We allow that certain examples are not classified due to lack of evidence, and provide a uniform bound for the margin. Moreover, we illustrate that when using averaged classification rules, maximizing the number of examples with margin above a given value, can overcome the problem of overfitting. In our illustration, the classification problem then boils down to optimizing over certain threshold-based classifiers.

[1]  A. W. van der Vaart,et al.  Uniform Central Limit Theorems , 2001 .

[2]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[3]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[4]  E. Mammen,et al.  Smooth Discrimination Analysis , 1999 .

[5]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[6]  P. Bühlmann,et al.  Analyzing Bagging , 2001 .

[7]  S. Geer Empirical Processes in M-Estimation , 2000 .

[8]  R. Tyrrell Rockafellar,et al.  Convex Analysis , 1970, Princeton Landmarks in Mathematics and Physics.

[9]  R. Dudley A course on empirical processes , 1984 .

[10]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[11]  A. Pajor,et al.  The entropy of convex bodies with ‘few’ extreme points , 1991 .

[12]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[13]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[14]  G. F. Clements Entropies of sets of functions of bounded variation , 1963 .

[15]  D. Pollard,et al.  Cube Root Asymptotics , 1990 .

[16]  F. T. Wright,et al.  Order restricted statistical inference , 1988 .

[17]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[18]  Robert E. Schapire,et al.  Efficient distribution-free learning of probabilistic concepts , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[19]  Yishay Mansour,et al.  Why averaging classifiers can protect against overfitting , 2001, AISTATS.

[20]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[21]  J. Wellner,et al.  Information Bounds and Nonparametric Maximum Likelihood Estimation , 1992 .