Quality Metrics in Recommender Systems: Do We Calculate Metrics Consistently?

Offline evaluation is a popular approach to determine the best algorithm in terms of the chosen quality metric. However, if the chosen metric calculates something unexpected, this miscommunication can lead to poor decisions and wrong conclusions. In this paper, we thoroughly investigate quality metrics used for recommender systems evaluation. We look at the practical aspect of implementations found in modern RecSys libraries and at the theoretical aspect of definitions in academic papers. We find that Precision is the only metric universally understood among papers and libraries, while other metrics may have different interpretations. Metrics implemented in different libraries sometimes have the same name but measure different things, which leads to different results given the same input. When defining metrics in an academic paper, authors sometimes omit explicit formulations or give references that do not contain explanations either. In 47% of cases, we cannot easily know how the metric is defined because the definition is not clear or absent. These findings highlight yet another difficulty in recommender system evaluation and call for a more detailed description of evaluation protocols.

[1]  Anne-Marie Tousch How robust is MovieLens? A dataset analysis for recommender systems , 2019, ArXiv.

[2]  Shujian Huang,et al.  Deep Matrix Factorization Models for Recommender Systems , 2017, IJCAI.

[3]  Alistair Moffat,et al.  Offline evaluation options for recommender systems , 2020, Information Retrieval Journal.

[4]  Iadh Ounis,et al.  Exploring Data Splitting Strategies for the Evaluation of Recommendation Models , 2020, RecSys.

[5]  Philip S. Yu,et al.  Leveraging Meta-path based Context for Top- N Recommendation with A Neural Co-Attention Model , 2018, KDD.

[6]  Matthew D. Hoffman,et al.  Variational Autoencoders for Collaborative Filtering , 2018, WWW.

[7]  Thomas Nedelec,et al.  Offline A/B Testing for Recommender Systems , 2018, WSDM.

[8]  Bin Shen,et al.  Collaborative Memory Network for Recommendation Systems , 2018, SIGIR.

[9]  Tommaso Di Noia,et al.  Elliot: A Comprehensive and Rigorous Framework for Reproducible Recommender Systems Evaluation , 2021, SIGIR.

[10]  George Karypis,et al.  SLIM: Sparse Linear Methods for Top-N Recommender Systems , 2011, 2011 IEEE 11th International Conference on Data Mining.

[11]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[12]  James She,et al.  Collaborative Variational Autoencoder for Recommender Systems , 2017, KDD.

[13]  Tat-Seng Chua,et al.  Neural Collaborative Filtering , 2017, WWW.

[14]  Dietmar Jannach,et al.  Are we really making much progress? A worrying analysis of recent neural recommendation approaches , 2019, RecSys.

[15]  Lei Zheng,et al.  Spectral collaborative filtering , 2018, RecSys.

[16]  Jie Zhang,et al.  A Critical Study on Data Leakage in Recommender System Offline Evaluation , 2020, ACM Trans. Inf. Syst..

[17]  Dit-Yan Yeung,et al.  Collaborative Deep Learning for Recommender Systems , 2014, KDD.

[18]  S. C. Hui,et al.  Translational Recommender Networks , 2017, ArXiv.

[19]  Tao Chen,et al.  TriRank: Review-aware Explainable Recommendation by Modeling Aspects , 2015, CIKM.

[20]  Yehuda Koren,et al.  On the Difficulty of Evaluating Baselines: A Study on Recommender Systems , 2019, ArXiv.

[21]  Deborah Estrin,et al.  OpenRec: A Modular Framework for Extensible and Adaptable Recommendation Algorithms , 2018, WSDM.

[22]  Walid Krichene,et al.  On Sampled Metrics for Item Recommendation , 2020, KDD.

[23]  Ruoming Jin,et al.  On Sampling Top-K Recommendation Evaluation , 2020, KDD.

[24]  George Karypis,et al.  Item-based top-N recommendation algorithms , 2004, TOIS.

[25]  Shota Yasui,et al.  A Practical Guide of Off-Policy Evaluation for Bandit Problems , 2020, ArXiv.

[26]  P. Castells,et al.  On Target Item Sampling in Offline Recommender System Evaluation , 2020, RecSys.

[27]  Jie Yang,et al.  Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison , 2020, RecSys.

[28]  Ji-Rong Wen,et al.  RecBole: Towards a Unified, Comprehensive and Efficient Framework for Recommendation Algorithms , 2020, CIKM.

[29]  Linpeng Huang,et al.  DELF: A Dual-Embedding based Deep Latent Factor Model for Recommendation , 2018, IJCAI.

[30]  Xiaoyu Du,et al.  Outer Product-based Neural Collaborative Filtering , 2018, IJCAI.

[31]  Harald Steck,et al.  Embarrassingly Shallow Autoencoders for Sparse Data , 2019, WWW.

[32]  Longbing Cao,et al.  CoupledCF: Learning Explicit and Implicit User-item Couplings in Recommendation for Deep Collaborative Filtering , 2018, IJCAI.

[33]  Yizhou Sun,et al.  Personalized entity recommendation: a heterogeneous information network approach , 2014, WSDM.

[34]  Dietmar Jannach,et al.  A Troubling Analysis of Reproducibility and Progress in Recommender Systems Research , 2021, ACM Trans. Inf. Syst..

[35]  Maik Thiele,et al.  Setting Goals and Choosing Metrics for Recommender System Evaluations , 2011 .

[36]  Derek Bridge,et al.  Debiased offline evaluation of recommender systems: a weighted-sampling approach , 2020, SAC.