New Insights into Metric Optimization for Ranking-based Recommendation

Direct optimization of IR metrics has often been adopted as an approach to devise and develop ranking-based recommender systems. Most methods following this approach (e.g. TFMAP, CLiMF, Top-N-Rank) aim at optimizing the same metric being used for evaluation, under the assumption that this will lead to the best performance. A number of studies of this practice bring this assumption, however, into question. In this paper, we dig deeper into this issue in order to learn more about the effects of the choice of the metric to optimize on the performance of a ranking-based recommender system. We present an extensive experimental study conducted on different datasets in both pairwise and listwise learning-to-rank (LTR) scenarios, to compare the relative merit of four popular IR metrics, namely RR, AP, nDCG and RBP, when used for optimization and assessment of recommender systems in various combinations. For the first three, we follow the practice of loss function formulation available in literature. For the fourth one, we propose novel loss functions inspired by RBP for both the pairwise and listwise scenario. Our results confirm that the best performance is indeed not necessarily achieved when optimizing the same metric being used for evaluation. In fact, we find that RBP-inspired losses perform at least as well as other metrics in a consistent way, and offer clear benefits in several cases. Interesting to see is that RBP-inspired losses, while improving the recommendation performance for all uses, may lead to an individual performance gain that is correlated with the activity level of a user in interacting with items. The more active the users, the more they benefit. Overall, our results challenge the assumption behind the current research practice of optimizing and evaluating the same metric, and point to RBP-based optimization instead as a promising alternative when learning to rank in the recommendation context.

[1]  Claude E. Shannon,et al.  A mathematical theory of communication , 1948, MOCO.

[2]  Alan Hanjalic,et al.  List-wise learning to rank with matrix factorization for collaborative filtering , 2010, RecSys '10.

[3]  Enrique Amigó,et al.  An Axiomatic Analysis of Diversity Evaluation Metrics: Introducing the Rank-Biased Utility Metric , 2018, SIGIR.

[4]  Jianmo Ni,et al.  Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects , 2019, EMNLP.

[5]  Azadeh Shakery,et al.  ERR.Rank: An algorithm based on learning to rank for direct optimization of Expected Reciprocal Rank , 2018, Applied Intelligence.

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[8]  Mark Sanderson,et al.  Features of Disagreement Between Retrieval Effectiveness Measures , 2015, SIGIR.

[9]  Huan Liu,et al.  mTrust: discerning multi-faceted trust in a connected world , 2012, WSDM '12.

[10]  Meng Wang,et al.  Revisiting Graph based Collaborative Filtering: A Linear Residual Graph Convolutional Network Approach , 2020, AAAI.

[11]  Lars Schmidt-Thieme,et al.  BPR: Bayesian Personalized Ranking from Implicit Feedback , 2009, UAI.

[12]  Alejandro Bellogín,et al.  Assessing ranking metrics in top-N recommendation , 2020, Information Retrieval Journal.

[13]  Julio Gonzalo,et al.  A general evaluation measure for document organization tasks , 2013, SIGIR.

[14]  Michael D. Ekstrand LensKit for Python: Next-Generation Software for Recommender Systems Experiments , 2020, CIKM.

[15]  Rong Jin,et al.  Learning to Rank by Optimizing NDCG Measure , 2009, NIPS.

[16]  Alistair Moffat,et al.  Seven Numeric Properties of Effectiveness Metrics , 2013, AIRS.

[17]  Martha Larson,et al.  TFMAP: optimizing MAP for top-n context-aware recommendation , 2012, SIGIR '12.

[18]  Emine Yilmaz,et al.  The maximum entropy method for analyzing retrieval measures , 2005, SIGIR '05.

[19]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[20]  Shou-De Lin,et al.  LambdaMF: Learning Nonsmooth Ranking Functions in Matrix Factorization Using Lambda , 2015, 2015 IEEE International Conference on Data Mining.

[21]  Pan Zhou,et al.  Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning , 2020, NeurIPS.

[22]  Alexander J. Smola,et al.  Maximum Margin Matrix Factorization for Collaborative Ranking , 2007 .

[23]  Vasant Honavar,et al.  Top-N-Rank: A Scalable List-wise Ranking Method for Recommender Systems , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[24]  Martha Larson,et al.  CLiMF: learning to maximize reciprocal rank with collaborative less-is-more filtering , 2012, RecSys.

[25]  Yongdong Zhang,et al.  LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation , 2020, SIGIR.

[26]  Alistair Moffat,et al.  Models and metrics: IR evaluation as a user process , 2012, ADCS.

[27]  S. R. Searle,et al.  Population Marginal Means in the Linear Model: An Alternative to Least Squares Means , 1980 .

[28]  Peter Bailey,et al.  Incorporating User Expectations and Behavior into the Measurement of Search Effectiveness , 2017, ACM Trans. Inf. Syst..

[29]  Jun Wang,et al.  MarlRank: Multi-agent Reinforced Learning to Rank , 2019, CIKM.

[30]  James She,et al.  Collaborative Variational Autoencoder for Recommender Systems , 2017, KDD.

[31]  Noriko Kando,et al.  On information retrieval metrics designed for evaluation with incomplete relevance assessments , 2008, Information Retrieval.

[32]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[33]  Matthew Lease,et al.  Correlation, Prediction and Ranking of Evaluation Metrics in Information Retrieval , 2019, ECIR.

[34]  Alistair Moffat,et al.  Offline evaluation options for recommender systems , 2020, Information Retrieval Journal.

[35]  Alistair Moffat,et al.  Precision-at-ten considered redundant , 2008, SIGIR '08.

[36]  Tat-Seng Chua,et al.  Neural Collaborative Filtering , 2017, WWW.

[37]  Christopher J. C. Burges,et al.  From RankNet to LambdaRank to LambdaMART: An Overview , 2010 .

[38]  Werner Dubitzky,et al.  Fundamentals of Data Mining in Genomics and Proteomics , 2009 .

[39]  Wu-Jun Li,et al.  Collaborative Topic Regression with Social Regularization for Tag Recommendation , 2013, IJCAI.

[40]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[41]  P. Castells,et al.  On Target Item Sampling in Offline Recommender System Evaluation , 2020, RecSys.

[42]  Jie Yang,et al.  Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison , 2020, RecSys.

[43]  Weinan Zhang,et al.  LambdaFM: Learning Optimal Ranking with Factorization Machines Using Lambda Surrogates , 2016, CIKM.

[44]  Stephen E. Robertson,et al.  On the choice of effectiveness measures for learning to rank , 2010, Information Retrieval.

[45]  Josiane Mothe,et al.  How many performance measures to evaluate information retrieval systems? , 2011, Knowledge and Information Systems.

[46]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[47]  Pinar Donmez,et al.  On the local optimality of LambdaRank , 2009, SIGIR.

[48]  Yang Wang,et al.  Unbiased LambdaMART: An Unbiased Pairwise Learning-to-Rank Algorithm , 2018, WWW.

[49]  E. Jaynes On the rationale of maximum-entropy methods , 1982, Proceedings of the IEEE.

[50]  Charles L. A. Clarke,et al.  On the informativeness of cascade and intent-aware effectiveness measures , 2011, WWW.