Evaluating Retail Recommender Systems via Retrospective Data: Lessons Learnt from a Live-Intervention Study

Performance evaluation via retrospective data is essential to the development of recommender systems. However, it is necessary to ensure that the evaluation results are representative of live, interactive behaviour. We present a case study of several common evaluation strategies applied to data from a live intervention. The intervention is designed as a case-control experiment applied to two cohorts of consumers (active and non-active) from an online retailer. This results in four binary hit rate indicators of live performance to compare with evaluation strategies applied to the same basket data as was available immediately prior to the recommendations being made, treating them as historical data. It was found that in this case none of the standard evaluation strategies predicted comparable binary hit rates to those observed during the live intervention. We argue that they may not sufficiently represent live, interactive behaviour to usefully guide system development with retrospective data. We present a novel evaluation strategy that consistently provides binary hit rates comparable to the live results, which seems to mirror the actual operation of the recommender more closely, paying particular attention to the principles and constraints that are expected to apply. Key Words—Recommender Systems, Performance Evaluation, Model Selection & Comparison, Business Applications, Lessons Learnt

[1]  D. Sloane,et al.  An Introduction to Categorical Data Analysis , 1996 .

[2]  John Riedl,et al.  Item-based collaborative filtering recommendation algorithms , 2001, WWW '01.

[3]  Robin D. Burke,et al.  Hybrid Recommender Systems: Survey and Experiments , 2002, User Modeling and User-Adapted Interaction.

[4]  Andreas Thor,et al.  Adaptive website recommendations with AWESOME , 2005, The VLDB Journal.

[5]  Michael J. Pazzani,et al.  Learning and Revising User Profiles: The Identification of Interesting Web Sites , 1997, Machine Learning.

[6]  Jonathan L. Herlocker,et al.  Evaluating collaborative filtering recommender systems , 2004, TOIS.

[7]  David Heckerman,et al.  Empirical Analysis of Predictive Algorithms for Collaborative Filtering , 1998, UAI.

[8]  A. Agresti An introduction to categorical data analysis , 1997 .

[9]  Xin Jin,et al.  A maximum entropy web recommendation system: combining collaborative and content features , 2005, KDD '05.

[10]  Greg Linden,et al.  Amazon . com Recommendations Item-to-Item Collaborative Filtering , 2001 .

[11]  Chun-Nan Hsu,et al.  Mining Skewed and Sparse Transaction Data for Personalized Shopping Recommendation , 2004, Machine Learning.

[12]  Guy Shani,et al.  An MDP-Based Recommender System , 2002, J. Mach. Learn. Res..

[13]  John Riedl,et al.  An algorithmic framework for performing collaborative filtering , 1999, SIGIR '99.

[14]  Bradley N. Miller,et al.  MovieLens unplugged: experiences with an occasionally connected recommender system , 2003, IUI '03.

[15]  John Riedl,et al.  Analysis of recommendation algorithms for e-commerce , 2000, EC '00.

[16]  Yoav Shoham,et al.  Fab: content-based, collaborative recommendation , 1997, CACM.

[17]  Thomas J. Watson,et al.  An empirical study of the naive Bayes classifier , 2001 .

[18]  Lars Schmidt-Thieme,et al.  Evaluation of Attribute-Aware Recommender System Algorithms on Data with Varying Characteristics , 2006, PAKDD.