A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

htmlabstractThe evaluation of recommender systems is crucial for their development. In today's recommendation landscape there are many standardized recommendation algorithms and approaches, however, there exists no standardized method for experimental setup of evaluation -- not even for widely used measures such as precision and root-mean-squared error. This creates a setting where comparison of recommendation results using the same datasets becomes problematic. In this paper, we propose an evaluation protocol specifically developed with the recommendation use-case in mind, i.e. the recommendation of one or several items to an end user. The protocol attempts to closely mimic a scenario of a deployed (production) recommendation system, taking specific user aspects into consideration and allowing a comparison of small and large scale recommendation systems. The protocol is evaluated on common recommendation datasets and compared to traditional recommendation settings found in research literature. Our results show that the proposed model can better capture the quality of a recommender system than traditional evaluation does, and is not affected by characteristics of the data (e.g. size. sparsity, etc.).

[1]  Hinrich Schütze,et al.  Introduction to Information Retrieval: Evaluation in information retrieval , 2008 .

[2]  George Karypis,et al.  Evaluation of Item-Based Top-N Recommendation Algorithms , 2001, CIKM '01.

[3]  Li Chen,et al.  A user-centric evaluation framework for recommender systems , 2011, RecSys '11.

[4]  Mark Rosenstein,et al.  Recommending and evaluating choices in a virtual community of use , 1995, CHI '95.

[5]  Sahin Albayrak,et al.  User-centric evaluation of a K-furthest neighbor collaborative filtering recommender algorithm , 2013, CSCW.

[6]  Alejandro Bellogín,et al.  Precision-oriented evaluation of recommender systems: an algorithmic comparison , 2011, RecSys '11.

[7]  Òscar Celma,et al.  Music recommendation and discovery in the long tail , 2008 .

[8]  Sean M. McNee,et al.  Being accurate is not enough: how accuracy metrics have hurt recommender systems , 2006, CHI Extended Abstracts.

[9]  Jérôme Picault,et al.  How to Get the Recommender Out of the Lab? , 2011, Recommender Systems Handbook.

[10]  E. Rasmussen Evaluation in Information Retrieval , 2002 .

[11]  Jonathan L. Herlocker,et al.  Evaluating collaborative filtering recommender systems , 2004, TOIS.

[12]  Iván Cantador,et al.  Time-aware recommender systems: a comprehensive survey and analysis of existing evaluation protocols , 2013, User Modeling and User-Adapted Interaction.

[13]  Guy Shani,et al.  Evaluating Recommendation Systems , 2011, Recommender Systems Handbook.

[14]  Harald Steck,et al.  Item popularity and recommendation accuracy , 2011, RecSys '11.

[15]  Ron Kohavi Online controlled experiments: introduction, learnings, and humbling statistics , 2012, RecSys '12.

[16]  Roberto Turrin,et al.  Performance of recommender algorithms on top-n recommendation tasks , 2010, RecSys '10.

[17]  Franca Garzotto,et al.  Comparative evaluation of recommender system quality , 2011, CHI Extended Abstracts.

[18]  Alan Said,et al.  Putting things in context: Challenge on Context-Aware Movie Recommendation , 2010 .