Surrogate Losses in Passive and Active Learning

Active learning is a type of sequential design for supervised machine learning, in which the learning algorithm sequentially requests the labels of selected instances from a large pool of unlabeled data points. The objective is to produce a classifier of relatively low risk, as measured under the 0-1 loss, ideally using fewer label requests than the number of random labeled data points sufficient to achieve the same. This work investigates the potential uses of surrogate loss functions in the context of active learning. Specifically, it presents an active learning algorithm based on an arbitrary classification-calibrated surrogate loss function, along with an analysis of the number of label requests sufficient for the classifier returned by the algorithm to achieve a given risk under the 0-1 loss. Interestingly, these results cannot be obtained by simply optimizing the surrogate risk via active learning to an extent sufficient to provide a guarantee on the 0-1 loss, as is common practice in the analysis of surrogate losses for passive learning. Some of the results have additional implications for the use of surrogate losses in passive learning.

[1]  E. Mammen,et al.  Smooth Discrimination Analysis , 1999 .

[2]  W. Marsden I and J , 2012 .

[3]  K. Alexander,et al.  Rates of growth and sample moduli for weighted empirical processes indexed by sets , 1987 .

[4]  Jon A. Wellner,et al.  Ratio Limit Theorems for Empirical Processes , 2003 .

[5]  A. W. van der Vaart,et al.  A local maximal inequality under uniform entropy. , 2010, Electronic journal of statistics.

[6]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[7]  Claudio Gentile,et al.  Learning noisy linear classifiers via adaptive and selective sampling , 2011, Machine Learning.

[8]  V. Koltchinskii,et al.  Oracle inequalities in empirical risk minimization and sparse recovery problems , 2011 .

[9]  Steve Hanneke,et al.  Activized Learning: Transforming Passive to Active with Improved Label Complexity , 2011, J. Mach. Learn. Res..

[10]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[11]  Vladimir Koltchinskii,et al.  Rademacher penalties and structural risk minimization , 2001, IEEE Trans. Inf. Theory.

[12]  Liwei Wang,et al.  Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning , 2011, J. Mach. Learn. Res..

[13]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[14]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[15]  Steve Hanneke,et al.  A bound on the label complexity of agnostic active learning , 2007, ICML '07.

[16]  Steve Hanneke,et al.  Theoretical foundations of active learning , 2009 .

[17]  D. Angluin,et al.  Learning From Noisy Examples , 1988, Machine Learning.

[18]  Satyaki Mahalanabis A note on active learning for smooth problems , 2011, ArXiv.

[19]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[20]  John Langford,et al.  Agnostic active learning , 2006, J. Comput. Syst. Sci..

[21]  Eric Friedman,et al.  Active Learning for Smooth Problems , 2009, COLT.

[22]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[23]  S. Li Concise Formulas for the Area and Volume of a Hyperspherical Cap , 2011 .

[24]  Robert D. Nowak,et al.  Minimax Bounds for Active Learning , 2007, IEEE Transactions on Information Theory.

[25]  D. Pollard,et al.  $U$-Processes: Rates of Convergence , 1987 .

[26]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[27]  Sanjoy Dasgupta,et al.  A General Agnostic Active Learning Algorithm , 2007, ISAIM.

[28]  John Langford,et al.  Importance weighted active learning , 2008, ICML '09.

[29]  D. Pollard Empirical Processes: Theory and Applications , 1990 .

[30]  Stanislav Minsker,et al.  Plug-in Approach to Active Learning , 2011, J. Mach. Learn. Res..

[31]  V. Koltchinskii Local Rademacher complexities and oracle inequalities in risk minimization , 2006, 0708.0083.

[32]  R. Dudley Universal Donsker Classes and Metric Entropy , 1987 .

[33]  Steve Hanneke Rates of convergence in active learning , 2011, 1103.1790.

[34]  Gilles Blanchard,et al.  On the Rate of Convergence of Regularized Boosting Classifiers , 2003, J. Mach. Learn. Res..

[35]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[36]  Vladimir Koltchinskii,et al.  Rademacher Complexities and Bounding the Excess Risk in Active Learning , 2010, J. Mach. Learn. Res..

[37]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[38]  Journal of the Association for Computing Machinery , 1961, Nature.

[39]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[40]  Rocco A. Servedio,et al.  Agnostically learning halfspaces , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[41]  Steve Hanneke,et al.  Theory of Disagreement-Based Active Learning , 2014, Found. Trends Mach. Learn..

[42]  Maxim Raginsky,et al.  Lower Bounds for Passive and Active Learning , 2011, NIPS.

[43]  Maria-Florina Balcan,et al.  The true sample complexity of active learning , 2010, Machine Learning.

[44]  V. Koltchinskii,et al.  Concentration inequalities and asymptotic results for ratio type empirical processes , 2006, math/0606788.

[45]  Philip D. Plowright,et al.  Convexity , 2019, Optimization for Chemical and Biochemical Engineering.

[46]  Liu Yang,et al.  Negative Results for Active Learning with Convex Losses , 2010, AISTATS.

[47]  Linda Sellie,et al.  Toward efficient agnostic learning , 1992, COLT '92.

[48]  A. Tsybakov,et al.  Fast learning rates for plug-in classifiers , 2007, 0708.2321.

[49]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[50]  R. Dudley Central Limit Theorems for Empirical Measures , 1978 .

[51]  Claudio Gentile,et al.  Selective sampling and active learning from single and multiple teachers , 2012, J. Mach. Learn. Res..

[52]  J. Lamperti ON CONVERGENCE OF STOCHASTIC PROCESSES , 1962 .

[53]  Minimization Vladimir Koltchinskii Rademacher Penalties and Structural Risk , 2001 .