Algorithms for active learning

This dissertation develops and analyzes active learning algorithms for binary classification problems. In passive (non-active) learning, a learner uses a random sample of labeled examples from a fixed distribution to select a hypothesis with low error. In active learning, a learner receives only a sample of unlabeled data, but has the option to query the label of any of these data points. The hope is that the active learner needs to query the labels of just a few, carefully chosen points in order to produce a hypothesis with low error. The first part of this dissertation develops algorithms based on maintaining a version space—the set of hypotheses still in contention to be selected. The version space is specifically designed to tolerate arbitrary label noise and model mismatch in the agnostic learning model. The algorithms maintain the version space using a reduction to a special form of agnostic learning that allows for example-based constraints; this represents a computational improvement over previous methods. The generalization behavior of one of these algorithms is rigorously analyzed using a quantity called the disagreement coefficient. This algorithm is shown to have label complexity that improves over that of previous methods, and matches known label complexity lower bounds in certain cases. The second part of this dissertation develops algorithms based on simpler reductions to agnostic learning that more closely match the standard abstraction of supervised learning procedures. The generalization behavior of these algorithms are also analyzed in the agnostic learning model, and are shown to have label complexity similar to the version space methods. Therefore, these algorithms represent qualitative improvements over version space methods, as strict version space methods can be risky to deploy in practice. The first of these algorithms is based on a relaxation of a version space method, and the second is based on an importance weighting technique. The second algorithm is also shown to automatically adapt to various noise conditions that imply a tighter label complexity analysis. Experiments using this algorithm are also presented to illustrate some of the promise of the method.

[1]  Tong Zhang,et al.  On the Dual Formulation of Regularized Linear Systems with Convex Risks , 2002, Machine Learning.

[2]  K. Alexander,et al.  Rates of growth and sample moduli for weighted empirical processes indexed by sets , 1987 .

[3]  V. Koltchinskii,et al.  Concentration inequalities and asymptotic results for ratio type empirical processes , 2006, math/0606788.

[4]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[5]  Maria-Florina Balcan,et al.  Margin Based Active Learning , 2007, COLT.

[6]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[7]  John Langford,et al.  Agnostic active learning , 2006, J. Comput. Syst. Sci..

[8]  Hinrich Schütze,et al.  Performance thresholding in practical text classification , 2006, CIKM '06.

[9]  Norbert Sauer,et al.  On the Density of Families of Sets , 1972, J. Comb. Theory A.

[10]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[11]  Eric Friedman,et al.  Active Learning for Smooth Problems , 2009, COLT.

[12]  Vitaly Feldman Optimal hardness results for maximizing agreements with monomials , 2006, 21st Annual IEEE Conference on Computational Complexity (CCC'06).

[13]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[14]  Adam Tauman Kalai,et al.  Analysis of Perceptron-Based Active Learning , 2009, COLT.

[15]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[16]  Vladimir Koltchinskii,et al.  Rademacher Complexities and Bounding the Excess Risk in Active Learning , 2010, J. Mach. Learn. Res..

[17]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[18]  H. Sebastian Seung,et al.  Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[19]  Sanjoy Dasgupta,et al.  Coarse sample complexity bounds for active learning , 2005, NIPS.

[20]  Robert D. Nowak,et al.  Minimax Bounds for Active Learning , 2007, IEEE Transactions on Information Theory.

[21]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[22]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[23]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[24]  M. Talagrand Sharper Bounds for Gaussian and Empirical Processes , 1994 .

[25]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[26]  Dana Angluin,et al.  Queries and concept learning , 1988, Machine Learning.

[27]  John N. Tsitsiklis,et al.  Active Learning Using Arbitrary Binary Valued Queries , 1993, Machine Learning.

[28]  Sanjoy Dasgupta,et al.  Analysis of a greedy active learning strategy , 2004, NIPS.

[29]  R. Schapire,et al.  Toward efficient agnostic learning , 1992, COLT '92.

[30]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[31]  Daphne Koller,et al.  Support Vector Machine Active Learning with Application sto Text Classification , 2000, ICML.

[32]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[33]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[34]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[35]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[36]  Matti Kääriäinen,et al.  Active Learning in the Non-realizable Case , 2006, ALT.

[37]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[38]  Prasad Raghavendra,et al.  Hardness of Learning Halfspaces with Noise , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[39]  Tong Zhang Data Dependent Concentration Bounds for Sequential Prediction Algorithms , 2005, COLT.

[40]  Steve Hanneke,et al.  Adaptive Rates of Convergence in Active Learning , 2009, COLT.

[41]  Sanjoy Dasgupta,et al.  Hierarchical sampling for active learning , 2008, ICML '08.

[42]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[43]  Greg Schohn,et al.  Less is More: Active Learning with Support Vector Machines , 2000, ICML.

[44]  Naftali Tishby,et al.  Query by Committee Made Real , 2005, NIPS.

[45]  Maria-Florina Balcan,et al.  The true sample complexity of active learning , 2010, Machine Learning.

[46]  Sanjoy Dasgupta,et al.  A General Agnostic Active Learning Algorithm , 2007, ISAIM.

[47]  John Langford,et al.  Importance weighted active learning , 2008, ICML '09.

[48]  Liwei Wang,et al.  Sufficient Conditions for Agnostic Active Learnable , 2009, NIPS.

[49]  Dana Angluin Queries revisited , 2004, Theor. Comput. Sci..

[50]  David Haussler,et al.  Sphere Packing Numbers for Subsets of the Boolean n-Cube with Bounded Vapnik-Chervonenkis Dimension , 1995, J. Comb. Theory, Ser. A.

[51]  Steve Hanneke,et al.  A bound on the label complexity of agnostic active learning , 2007, ICML '07.