Learning noisy linear classifiers via adaptive and selective sampling

We introduce efficient margin-based algorithms for selective sampling and filtering in binary classification tasks. Experiments on real-world textual data reveal that our algorithms perform significantly better than popular and similarly efficient competitors. Using the so-called Mammen-Tsybakov low noise condition to parametrize the instance distribution, and assuming linear label noise, we show bounds on the convergence rate to the Bayes risk of a weaker adaptive variant of our selective sampler. Our analysis reveals that, excluding logarithmic factors, the average risk of this adaptive sampler converges to the Bayes risk at rate N−(1+α)(2+α)/2(3+α) where N denotes the number of queried labels, and α>0 is the exponent in the low noise condition. For all $\alpha>\sqrt{3}-1\approx0.73$ this convergence rate is asymptotically faster than the rate N−(1+α)/(2+α) achieved by the fully supervised version of the base selective sampler, which queries all labels. Moreover, for α→∞ (hard margin condition) the gap between the semi- and fully-supervised rates becomes exponential.

[1]  David A. Cohn,et al.  Training Connectionist Networks with Queries and Selective Sampling , 1989, NIPS.

[2]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[3]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[4]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[5]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[6]  Nello Cristianini,et al.  Query Learning with Large Margin Classifiers , 2000, ICML.

[7]  Daphne Koller,et al.  Support Vector Machine Active Learning with Application sto Text Classification , 2000, ICML.

[8]  Craig A. Knoblock,et al.  Selective Sampling with Redundant Views , 2000, AAAI/IAAI.

[9]  Philip M. Long,et al.  Apple Tasting , 2000, Inf. Comput..

[10]  V. Vovk Competitive On‐line Statistics , 2001 .

[11]  Dana Angluin,et al.  Queries revisited , 2001, Theoretical Computer Science.

[12]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2002, J. Mach. Learn. Res..

[13]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[14]  H. Sebastian Seung,et al.  Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[15]  Claudio Gentile,et al.  Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms , 2004, NIPS.

[16]  Claudio Gentile,et al.  Incremental Algorithms for Hierarchical Classification , 2004, J. Mach. Learn. Res..

[17]  Manfred K. Warmuth,et al.  Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions , 1999, Machine Learning.

[18]  Gilles Blanchard,et al.  Statistical properties of Kernel Prinicipal Component Analysis , 2019 .

[19]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[20]  Naftali Tishby,et al.  Query by Committee Made Real , 2005, NIPS.

[21]  Nello Cristianini,et al.  On the eigenspectrum of the gram matrix and the generalization error of kernel-PCA , 2005, IEEE Transactions on Information Theory.

[22]  Claudio Gentile,et al.  A Second-Order Perceptron Algorithm , 2002, SIAM J. Comput..

[23]  Adam Tauman Kalai,et al.  Analysis of Perceptron-Based Active Learning , 2009, COLT.

[24]  S. Boucheron,et al.  Theory of classification : a survey of some recent advances , 2005 .

[25]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[26]  Mikio L. Braun,et al.  Accurate Error Bounds for the Eigenvalues of the Kernel Matrix , 2006, J. Mach. Learn. Res..

[27]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[28]  Claudio Gentile,et al.  Worst-Case Analysis of Selective Sampling for Linear Classification , 2006, J. Mach. Learn. Res..

[29]  Yiming Ying,et al.  Online Regularized Classification Algorithms , 2006, IEEE Transactions on Information Theory.

[30]  Matti Kääriäinen,et al.  Active Learning in the Non-realizable Case , 2006, ALT.

[31]  Claire Monteleoni,et al.  Practical Online Active Learning for Classification , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Maria-Florina Balcan,et al.  Margin Based Active Learning , 2007, COLT.

[33]  Sanjoy Dasgupta,et al.  A General Agnostic Active Learning Algorithm , 2007, ISAIM.

[34]  Ingo Steinwart,et al.  Fast rates for support vector machines using Gaussian kernels , 2007, 0708.1838.

[35]  Michael L. Littman,et al.  Online Linear Regression and Its Application to Model-Based Reinforcement Learning , 2007, NIPS.

[36]  Steve Hanneke,et al.  A bound on the label complexity of agnostic active learning , 2007, ICML '07.

[37]  Thomas J. Walsh,et al.  Knows what it knows: a framework for self-aware learning , 2008, ICML.

[38]  Carla E. Brodley,et al.  Advances in online learning-based spam filtering , 2008 .

[39]  Robert D. Nowak,et al.  Minimax Bounds for Active Learning , 2007, IEEE Transactions on Information Theory.

[40]  Steve Hanneke,et al.  Adaptive Rates of Convergence in Active Learning , 2009, COLT.

[41]  John Langford,et al.  Agnostic active learning , 2006, J. Comput. Syst. Sci..

[42]  W. Marsden I and J , 2012 .