Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking & Other Missing Label Applications

The choice of the loss function is critical in extreme multi-label learning where the objective is to annotate each data point with the most relevant subset of labels from an extremely large label set. Unfortunately, existing loss functions, such as the Hamming loss, are unsuitable for learning, model selection, hyperparameter tuning and performance evaluation. This paper addresses the issue by developing propensity scored losses which: (a) prioritize predicting the few relevant labels over the large number of irrelevant ones; (b) do not erroneously treat missing labels as irrelevant but instead provide unbiased estimates of the true loss function even when ground truth labels go missing under arbitrary probabilistic label noise models; and (c) promote the accurate prediction of infrequently occurring, hard to predict, but rewarding tail labels. Another contribution is the development of algorithms which efficiently scale to extremely large datasets with up to 9 million labels, 70 million points and 2 million dimensions and which give significant improvements over the state-of-the-art. This paper's results also apply to tagging, recommendation and ranking which are the motivating applications for extreme multi-label learning. They generalize previous attempts at deriving unbiased losses under the restrictive assumption that labels go missing uniformly at random from the ground truth. Furthermore, they provide a sound theoretical justification for popular label weighting heuristics used to recommend rare items. Finally, they demonstrate that the proposed contributions align with real world applications by achieving superior clickthrough rates on sponsored search advertising in Bing.

[1]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[2]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[3]  Johannes Fürnkranz,et al.  Efficient Pairwise Multilabel Classification for Large-Scale Problems in the Legal Domain , 2008, ECML/PKDD.

[4]  Jieping Ye,et al.  Extracting shared subspace for multi-label classification , 2008, KDD.

[5]  John Langford,et al.  Multi-Label Prediction via Compressed Sensing , 2009, NIPS.

[6]  Yi-Cheng Zhang,et al.  Solving the apparent diversity-accuracy dilemma of recommender systems , 2008, Proceedings of the National Academy of Sciences.

[7]  Zhi-Hua Zhou,et al.  Multi-Label Learning with Weak Label , 2010, AAAI.

[8]  Harald Steck,et al.  Training and testing of recommender systems on data missing not at random , 2010, KDD.

[9]  Hsuan-Tien Lin,et al.  Multi-label Classification with Error-correcting Codes , 2011, ACML.

[10]  Harald Steck,et al.  Item popularity and recommendation accuracy , 2011, RecSys '11.

[11]  Alexander J. Smola,et al.  Linear-Time Estimators for Propensity Scores , 2011, AISTATS.

[12]  Jason Weston,et al.  WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.

[13]  Jeff G. Schneider,et al.  Multi-Label Output Codes using Canonical Correlation Analysis , 2011, AISTATS.

[14]  Saul Vargas,et al.  Rank and relevance in novelty and diversity metrics for recommender systems , 2011, RecSys '11.

[15]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[16]  Pablo Castells,et al.  Novelty and diversity metrics for recommender systems: Choice, discovery and relevance , 2011 .

[17]  S. V. N. Vishwanathan,et al.  Efficient max-margin multi-label classification with applications to zero-shot learning , 2012, Machine Learning.

[18]  Ashish Kapoor,et al.  Multilabel Classification using Bayesian Compressed Sensing , 2012, NIPS.

[19]  Krishnakumar Balasubramanian,et al.  The Landmark Selection Method for Multiple Output Prediction , 2012, ICML.

[20]  A. Zubiaga Enhancing Navigation on Wikipedia with Social Tags , 2012, ArXiv.

[21]  Hsuan-Tien Lin,et al.  Multilabel Classification with Principal Label Space Transformation , 2012, Neural Computation.

[22]  Patrick Gallinari,et al.  Ranking with non-random missing ratings: influence of popularity and positivity on evaluation metrics , 2012, RecSys.

[23]  Hsuan-Tien Lin,et al.  Feature-aware Label Space Dimension Reduction for Multi-label Classification , 2012, NIPS.

[24]  Jason Weston,et al.  Label Partitioning For Sublinear Ranking , 2013, ICML.

[25]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[26]  Jure Leskovec,et al.  Hidden factors and hidden topics: understanding rating dimensions with review text , 2013, RecSys.

[27]  Moustapha Cissé,et al.  Robust Bloom Filters for Large MultiLabel Classification Tasks , 2013, NIPS.

[28]  Yiming Yang,et al.  Recursive regularization for large-scale classification with hierarchical and graphical dependencies , 2013, KDD.

[29]  Liang Zhang,et al.  The Definition of Novelty in Recommendation System , 2013 .

[30]  Manik Varma,et al.  Multi-label learning with millions of labels: recommending advertiser bid phrases for web pages , 2013, WWW.

[31]  James T. Kwok,et al.  Efficient Multi-label Classification with Many Labels , 2013, ICML.

[32]  Manik Varma,et al.  FastXML: a fast, accurate and stable tree-classifier for extreme multi-label learning , 2014, KDD.

[33]  Inderjit S. Dhillon,et al.  Large-scale Multi-label Learning with Missing Labels , 2013, ICML.

[34]  Hao Wu,et al.  On improving aggregate recommendation diversity and novelty in folksonomy-based social systems , 2014, Personal and Ubiquitous Computing.

[35]  Jianmin Wang,et al.  Multi-label Classification via Feature-aware Implicit Label Space Encoding , 2014, ICML.

[36]  Philip S. Yu,et al.  Large-Scale Multi-Label Learning with Incomplete Label Assignments , 2014, SDM.

[37]  Nikos Karampatziakis,et al.  Fast Label Embeddings for Extremely Large Output Spaces , 2015, International Conference on Learning Representations.

[38]  Lihong Li,et al.  Counterfactual Estimation and Optimization of Click Metrics in Search Engines: A Case Study , 2015, WWW.

[39]  Thorsten Joachims,et al.  The Self-Normalized Estimator for Counterfactual Learning , 2015, NIPS.

[40]  Lawrence Carin,et al.  Large-Scale Bayesian Multi-Label Learning via Topic-Based Label Embeddings , 2015, NIPS.

[41]  Stéphan Clémençon,et al.  Collaborative Filtering with Localised Ranking , 2015, AAAI.

[42]  Jure Leskovec,et al.  Inferring Networks of Substitutable and Complementary Products , 2015, KDD.

[43]  Prateek Jain,et al.  Sparse Local Embeddings for Extreme Multi-label Classification , 2015, NIPS.

[44]  Paul Mineiro,et al.  Scalable Multilabel Prediction via Randomized Methods , 2015, 1502.02710.

[45]  Georgios Paliouras,et al.  LSHTC: A Benchmark for Large-Scale Text Classification , 2015, ArXiv.

[46]  Nagarajan Natarajan,et al.  PU Learning for Matrix Completion , 2014, ICML.

[47]  Gert R. G. Lanckriet,et al.  Top-N Recommendation with Missing Implicit Feedback , 2015, RecSys.

[48]  Thorsten Joachims,et al.  Counterfactual Risk Minimization: Learning from Logged Bandit Feedback , 2015, ICML.

[49]  Anton van den Hengel,et al.  Image-Based Recommendations on Styles and Substitutes , 2015, SIGIR.