Linear Classification and Selective Sampling Under Low Noise Conditions

We provide a new analysis of an efficient margin-based algorithm for selective sampling in classification problems. Using the so-called Tsybakov low noise condition to parametrize the instance distribution, we show bounds on the convergence rate to the Bayes risk of both the fully supervised and the selective sampling versions of the basic algorithm. Our analysis reveals that, excluding logarithmic factors, the average risk of the selective sampler converges to the Bayes risk at rate N-(1+α)(2+α)/2(3+α) where N denotes the number of queried labels, and α > 0 is the exponent in the low noise condition. For all α > √ - 1 ≈ 0.73 this convergence rate is asymptotically faster than the rate N-(1+α)/(2+α) achieved by the fully supervised version of the same classifier, which queries all labels, and for α → ∞ the two rates exhibit an exponential gap. Experiments on textual data reveal that simple variants of the proposed selective sampler perform much better than popular and similarly efficient competitors.

[2]  Gilles Blanchard,et al.  Statistical properties of kernel principal component analysis , 2007, Machine Learning.

[3]  Claire Monteleoni,et al.  Practical Online Active Learning for Classification , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Claudio Gentile,et al.  Learning Probabilistic Linear-Threshold Classifiers via Selective Sampling , 2003, COLT.

[5]  Maria-Florina Balcan,et al.  Agnostic active learning , 2006, J. Comput. Syst. Sci..

[6]  Sanjoy Dasgupta,et al.  A General Agnostic Active Learning Algorithm , 2007, ISAIM.

[7]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[8]  H. Sebastian Seung,et al.  Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[9]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[10]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[11]  Nello Cristianini,et al.  On the eigenspectrum of the gram matrix and the generalization error of kernel-PCA , 2005, IEEE Transactions on Information Theory.

[12]  Nello Cristianini,et al.  Query Learning with Large Margin Classifiers , 2000, ICML.

[13]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[14]  Naftali Tishby,et al.  Query by Committee Made Real , 2005, NIPS.

[15]  Manfred K. Warmuth,et al.  Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions , 1999, Machine Learning.

[16]  Yiming Ying,et al.  Online Regularized Classification Algorithms , 2006, IEEE Transactions on Information Theory.

[17]  Mikio L. Braun,et al.  Accurate Error Bounds for the Eigenvalues of the Kernel Matrix , 2006, J. Mach. Learn. Res..

[18]  V. Vovk Competitive On‐line Statistics , 2001 .

[19]  Maria-Florina Balcan,et al.  Margin Based Active Learning , 2007, COLT.

[20]  David A. Cohn,et al.  Training Connectionist Networks with Queries and Selective Sampling , 1989, NIPS.

[21]  Daphne Koller,et al.  Support Vector Machine Active Learning with Application sto Text Classification , 2000, ICML.

[22]  Claudio Gentile,et al.  Worst-Case Analysis of Selective Sampling for Linear Classification , 2006, J. Mach. Learn. Res..

[23]  Dana Angluin,et al.  Queries revisited , 2001, Theoretical Computer Science.

[24]  Ingo Steinwart,et al.  Fast rates for support vector machines using Gaussian kernels , 2007, 0708.1838.

[25]  Robert D. Nowak,et al.  Minimax Bounds for Active Learning , 2007, IEEE Transactions on Information Theory.

[26]  Adam Tauman Kalai,et al.  Analysis of Perceptron-Based Active Learning , 2009, COLT.

[27]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .