Activized Learning: Transforming Passive to Active with Improved Label Complexity

We study the theoretical advantages of active learning over passive learning. Specifically, we prove that, in noise-free classifier learning for VC classes, any passive learning algorithm can be transformed into an active learning algorithm with asymptotically strictly superior label complexity for all nontrivial target functions and distributions. We further provide a general characterization of the magnitudes of these improvements in terms of a novel generalization of the disagreement coefficient. We also extend these results to active learning in the presence of label noise, and find that even under broad classes of noise distributions, we can typically guarantee strict improvements over the known results for passive learning.

[1]  David Haussler,et al.  Predicting {0,1}-functions on randomly drawn points , 1988, COLT '88.

[2]  Maria-Florina Balcan,et al.  The true sample complexity of active learning , 2010, Machine Learning.

[3]  Xiaotong Shen,et al.  On L1-Norm Multiclass Support Vector Machines , 2007 .

[4]  K. Alexander,et al.  Probability Inequalities for Empirical Processes and a Law of the Iterated Logarithm , 1984 .

[5]  Mark Braverman,et al.  Learnability and automatizability , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[6]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[7]  Liwei Wang,et al.  Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning , 2011, J. Mach. Learn. Res..

[8]  Matti Kääriäinen,et al.  Active Learning in the Non-realizable Case , 2006, ALT.

[9]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[10]  Shashi M. Srivastava,et al.  A Course on Borel Sets , 1998, Graduate texts in mathematics.

[11]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[12]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[13]  E. Mammen,et al.  Smooth Discrimination Analysis , 1999 .

[14]  Steve Hanneke,et al.  Adaptive Rates of Convergence in Active Learning , 2009, COLT.

[15]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[16]  John Langford,et al.  Agnostic Active Learning Without Constraints , 2010, NIPS.

[17]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[18]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[19]  Peter Auer,et al.  A new PAC bound for intersection-closed concept classes , 2004, Machine Learning.

[20]  A. Tsybakov,et al.  Sparsity oracle inequalities for the Lasso , 2007, 0705.3308.

[21]  S. Li Concise Formulas for the Area and Volume of a Hyperspherical Cap , 2011 .

[22]  John Langford,et al.  Agnostic active learning , 2006, J. Comput. Syst. Sci..

[23]  Eric Friedman,et al.  Active Learning for Smooth Problems , 2009, COLT.

[24]  Naoki Abe,et al.  Query Learning Strategies Using Boosting and Bagging , 1998, ICML.

[25]  Guilherme V. Rocha,et al.  Asymptotic distribution and sparsistency for `1 penalized parametric M-estimators, with applications to linear SVM and logistic regression , 2009, 0908.1940.

[26]  V. Koltchinskii Local Rademacher complexities and oracle inequalities in risk minimization , 2006, 0708.0083.

[27]  Sanjoy Dasgupta,et al.  Coarse sample complexity bounds for active learning , 2005, NIPS.

[28]  Rong Jin,et al.  Batch mode active learning and its application to medical image classification , 2006, ICML.

[29]  Jason Baldridge,et al.  How well does active learning actually work? Time-based evaluation of cost-reduction strategies for language documentation. , 2009, EMNLP.

[30]  Jaime G. Carbonell,et al.  Active Learning in Example-Based Machine Translation , 2009, NODALIDA.

[31]  Leslie G. Valiant,et al.  Computational limitations on learning from examples , 1988, JACM.

[32]  Lisa Hellerstein,et al.  How Many Queries Are Needed to Learn? , 1996, J. ACM.

[33]  V. Koltchinskii,et al.  Oracle inequalities in empirical risk minimization and sparse recovery problems , 2011 .

[34]  Greg Schohn,et al.  Less is More: Active Learning with Support Vector Machines , 2000, ICML.

[35]  R. M. Dudley,et al.  Real Analysis and Probability , 1989 .

[36]  Jaime G. Carbonell,et al.  The Sample Complexity of Self-Verifying Bayesian Active Learning , 2011, AISTATS.

[37]  Manfred K. Warmuth,et al.  Learning Nested Differences of Intersection-Closed Concept Classes , 1989, COLT '89.

[38]  Gábor Lugosi,et al.  Strong minimax lower bounds for learning , 1996, COLT '96.

[39]  P. Massart,et al.  Risk bounds for statistical learning , 2007, math/0702683.

[40]  Dan Roth,et al.  Margin-Based Active Learning for Structured Output Spaces , 2006, ECML.

[41]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[42]  Peter L. Bartlett,et al.  Learning in Neural Networks: Theoretical Foundations , 1999 .

[43]  V. Koltchinskii,et al.  Concentration inequalities and asymptotic results for ratio type empirical processes , 2006, math/0606788.

[44]  Maria-Florina Balcan,et al.  Margin Based Active Learning , 2007, COLT.

[45]  L. G. H. Cijan A polynomial algorithm in linear programming , 1979 .

[46]  Robert B. Ash,et al.  Probability & Measure Theory , 1999 .

[47]  Michael Lindenbaum,et al.  Selective Sampling for Nearest Neighbor Classifiers , 1999, Machine Learning.

[48]  Ziv Bar-Yossef,et al.  Sampling lower bounds via information theory , 2003, STOC '03.

[49]  Abraham Wald,et al.  Statistical Decision Functions , 1951 .

[50]  Steve Hanneke Rates of convergence in active learning , 2011, 1103.1790.

[51]  Craig A. Knoblock,et al.  Active + Semi-supervised Learning = Robust Multi-View Learning , 2002, ICML.

[52]  L. Ungar,et al.  Active learning for logistic regression , 2005 .

[53]  Vladimir Koltchinskii,et al.  Rademacher Complexities and Bounding the Excess Risk in Active Learning , 2010, J. Mach. Learn. Res..

[54]  Michael Kearns,et al.  On the complexity of teaching , 1991, COLT '91.

[55]  Nello Cristianini,et al.  Query Learning with Large Margin Classi ersColin , 2000 .

[56]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[57]  Tibor Hegedűs,et al.  Generalized teaching dimensions and the query complexity of learning , 1995, Annual Conference Computational Learning Theory.

[58]  Marcus Hutter,et al.  MDL convergence speed for Bernoulli sequences , 2006, Stat. Comput..

[59]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[60]  Linda Sellie,et al.  Toward efficient agnostic learning , 1992, COLT '92.

[61]  H. Sebastian Seung,et al.  Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[62]  Lyle H. Ungar,et al.  Machine Learning manuscript No. (will be inserted by the editor) Active Learning for Logistic Regression: , 2007 .

[63]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[64]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[65]  Satyaki Mahalanabis A note on active learning for smooth problems , 2011, ArXiv.

[66]  Kamal Nigamyknigam,et al.  Employing Em in Pool-based Active Learning for Text Classiication , 1998 .

[67]  Robert D. Nowak,et al.  Minimax Bounds for Active Learning , 2007, IEEE Transactions on Information Theory.

[68]  C. A. Murthy,et al.  A probabilistic active support vector learning algorithm , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[69]  Adam Tauman Kalai,et al.  Analysis of Perceptron-Based Active Learning , 2009, COLT.

[70]  Dan Roth,et al.  Maximum Margin Coresets for Active and Noise Tolerant Learning , 2007, IJCAI.

[71]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[72]  Lawrence O. Hall,et al.  Active learning to recognize multiple types of plankton , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[73]  Steve Hanneke,et al.  Teaching Dimension and the Complexity of Active Learning , 2007, COLT.

[74]  Tibor Hegedüs,et al.  Generalized Teaching Dimensions and the Query Complexity of Learning , 1995, COLT.

[75]  Steve Hanneke,et al.  A bound on the label complexity of agnostic active learning , 2007, ICML '07.

[76]  Liwei Wang,et al.  Sufficient Conditions for Agnostic Active Learnable , 2009, NIPS.

[77]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[78]  A. Wald Sequential Tests of Statistical Hypotheses , 1945 .

[79]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[80]  Steve Hanneke,et al.  Theoretical foundations of active learning , 2009 .

[81]  L. Khachiyan Polynomial algorithms in linear programming , 1980 .

[82]  Andrew McCallum,et al.  Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[83]  Sanjoy Dasgupta,et al.  A General Agnostic Active Learning Algorithm , 2007, ISAIM.

[84]  John Langford,et al.  Importance weighted active learning , 2008, ICML '09.

[85]  Narendra Karmarkar,et al.  A new polynomial-time algorithm for linear programming , 1984, Comb..

[86]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[87]  Santosh S. Vempala,et al.  Kernels as features: On kernels, margins, and low-dimensional mappings , 2006, Machine Learning.

[88]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[89]  R. Nowak,et al.  Generalized binary search , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.