On learning multicategory classification with sample queries

Consider the pattern recognition problem of learning multicategory classification from a labeled sample, for instance, the problem of learning character recognition where a category corresponds to an alphanumeric letter. The classical theory of pattern recognition assumes labeled examples appear according to the unknown underlying pattern-class conditional probability distributions where the pattern classes are picked randomly according to their a priori probabilities. In this paper we pose the following question: Can the learning accuracy be improved if labeled examples are independently randomly drawn according to the underlying class conditional probability distributions but the pattern classes are chosen not necessarily according to their a priori probabilities? We answer this in the affirmative by showing that there exists a tuning of the sub-sample proportions which minimizes a loss criterion. The tuning is relative to the intrinsic complexity of the Bayes-classifier. As this complexity depends on the underlying probability distributions which are assumed to be unknown, we provide an algorithm which learns the proportions in an on-line manner utilizing sample querying which asymptotically minimizes the criterion. In practice, this algorithm may be used to boost the performance of existing learning classification algorithms by apportioning better sub-sample proportions.

[1]  Vladimir Koltchinskii,et al.  Rademacher penalties and structural risk minimization , 2001, IEEE Trans. Inf. Theory.

[2]  Peter L. Bartlett,et al.  Model Selection and Error Estimation , 2000, Machine Learning.

[3]  A. Barron Approximation and Estimation Bounds for Artificial Neural Networks , 1991, COLT '91.

[4]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[5]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[6]  P. R. Kumar,et al.  Learning by Canonical Smooth Es timation-Part I: Simultaneous Estimation , 1996 .

[7]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[8]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[9]  Ron Meir,et al.  Towards robust model selection using estimation and approximation error bounds , 1996, COLT '96.

[10]  Dana Angluin,et al.  Queries and concept learning , 1988, Machine Learning.

[11]  Partha Niyogi,et al.  Free to Choose: Investigating the Sample Complexity of Active Learning of Real Valued Functions , 1995, ICML.

[12]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[13]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[14]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[15]  David G. Stork,et al.  Pattern Classification , 1973 .

[16]  G. Lugosi,et al.  Adaptive Model Selection Using Empirical Complexities , 1998 .

[17]  Peter L. Bartlett,et al.  Learning in Neural Networks: Theoretical Foundations , 1999 .

[18]  M. Kendall,et al.  Kendall's advanced theory of statistics , 1995 .

[19]  V. Vapnik,et al.  Necessary and Sufficient Conditions for the Uniform Convergence of Means to their Expectations , 1982 .

[20]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[21]  Joel Ratsaby,et al.  Incremental Learning With Sample Queries , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[23]  Andrew R. Barron,et al.  Approximation and estimation bounds for artificial neural networks , 2004, Machine Learning.

[24]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[25]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[26]  Ron Meir,et al.  Bias, Variance and the Combination of Least Squares Estimators , 1994, NIPS.

[27]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[28]  David A. Cohn,et al.  Neural Network Exploration Using Optimal Experiment Design , 1993, NIPS.

[29]  J. Craggs Applied Mathematical Sciences , 1973 .

[30]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[31]  G. Pflug Stochastic Approximation Methods for Constrained and Unconstrained Systems - Kushner, HJ.; Clark, D.S. , 1980 .

[32]  G. Lugosi,et al.  Concept learning using complexity regularization , 1995, Proceedings of 1995 IEEE International Symposium on Information Theory.

[33]  Ronald L. Rivest,et al.  On the sample complexity of pac-learning using random and chosen examples , 1990, Annual Conference Computational Learning Theory.

[34]  John N. Tsitsiklis,et al.  Active Learning Using Arbitrary Binary Valued Queries , 1993, Machine Learning.

[35]  P. R. Kumar,et al.  Learning by canonical smooth estimation. I. Simultaneous estimation , 1996, IEEE Trans. Autom. Control..

[36]  Tom M. Mitchell,et al.  Explanation-Based Generalization: A Unifying View , 1986, Machine Learning.

[37]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[38]  M. Kendall,et al.  The advanced theory of statistics , 1945 .