(Almost) No Label No Cry

In Learning with Label Proportions (LLP), the objective is to learn a supervised classifier when, instead of labels, only label proportions for bags of observations are known. This setting has broad practical relevance, in particular for privacy preserving data processing. We first show that the mean operator, a statistic which aggregates all labels, is minimally sufficient for the minimization of many proper scoring losses with linear (or kernelized) classifiers without using labels. We provide a fast learning algorithm that estimates the mean operator via a manifold regularizer with guaranteed approximation bounds. Then, we present an iterative learning algorithm that uses this as initialization. We ground this algorithm in Rademacher-style generalization bounds that fit the LLP setting, introducing a generalization of Rademacher complexity and a Label Proportion Complexity measure. This latter algorithm optimizes tractable bounds for the corresponding bag-empirical risk. Experiments are provided on fourteen domains, whose size ranges up to ≈300K observations. They display that our algorithms are scalable and tend to consistently outperform the state of the art in LLP. Moreover, in many cases, our algorithms compete with or are just percents of AUC away from the Oracle that learns knowing all labels. On the largest domains, half a dozen proportions can suffice, i.e. roughly 40K times less than the total number of labels.

[1]  Katharina Morik,et al.  Learning from Label Proportions by Optimizing Cluster Model Selection , 2011, ECML/PKDD.

[2]  Janusz Wojtusiak,et al.  Using Published Medical Results and Non-homogenous Data in Rule Learning , 2011, 2011 10th International Conference on Machine Learning and Applications and Workshops.

[3]  M. Talagrand,et al.  Probability in Banach spaces , 1991 .

[4]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[5]  Dong Liu,et al.  $\propto$SVM for learning with label proportions , 2013, ICML 2013.

[6]  Shih-Fu Chang,et al.  On Learning with Label Proportions , 2014, ArXiv.

[7]  Gideon S. Mann,et al.  Generalized Expectation Criteria for Semi-Supervised Learning of Conditional Random Fields , 2008, ACL.

[8]  David R. Musicant,et al.  Learning from Aggregate Views , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[9]  Bin Liu,et al.  Kernel K-means Based Framework for Aggregate Outputs Classification , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[10]  Yishay Mansour,et al.  On the boosting ability of top-down decision tree learning algorithms , 1996, STOC '96.

[11]  Nando de Freitas,et al.  Learning about Individuals from Group Statistics , 2005, UAI.

[12]  Alexander J. Smola,et al.  Estimating labels from label proportions , 2008, ICML '08.

[13]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[14]  Iñaki Inza,et al.  Learning Bayesian network classifiers from label proportions , 2013, Pattern Recognit..

[15]  M. Habib Probabilistic methods for algorithmic discrete mathematics , 1998 .

[16]  Liwei Wang,et al.  Learning a generative classifier from label proportions , 2014, Neurocomputing.

[17]  Ben Taskar,et al.  Expectation Maximization and Posterior Constraints , 2007, NIPS.

[18]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[19]  Ming-Syan Chen,et al.  Video Event Detection by Inferring Temporal Instance Labels , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Thomas P. Hayes,et al.  Error limiting reductions between classification tasks , 2005, ICML.

[21]  Richard Nock,et al.  ( Almost ) No Label No Cry-Supplementary Material , 2015 .

[22]  Yishay Mansour,et al.  On the Boosting Ability of Top-Down Decision Tree Learning Algorithms , 1999, J. Comput. Syst. Sci..

[23]  Stefan R ping SVM Classifier Estimation from Group Probabilities , 2010, ICML 2010.

[24]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[25]  David R. Musicant,et al.  Supervised Learning by Training on Aggregate Outputs , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[26]  Dan Klein,et al.  Learning from measurements in exponential families , 2009, ICML '09.

[27]  Xin Guo,et al.  On the optimality of conditional expectation as a Bregman predictor , 2005, IEEE Trans. Inf. Theory.

[28]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[29]  Alexander J. Smola,et al.  Unifying Divergence Minimization and Statistical Inference Via Convex Duality , 2006, COLT.

[30]  Frank Nielsen,et al.  Bregman Divergences and Surrogates for Learning , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Stefan Rüping,et al.  SVM Classifier Estimation from Group Probabilities , 2010, ICML.