Towards reproducibility in recommender-systems research

Numerous recommendation approaches are in use today. However, comparing their effectiveness is a challenging task because evaluation results are rarely reproducible. In this article, we examine the challenge of reproducibility in recommender-system research. We conduct experiments using Plista’s news recommender system, and Docear’s research-paper recommender system. The experiments show that there are large discrepancies in the effectiveness of identical recommendation approaches in only slightly different scenarios, as well as large discrepancies for slightly different approaches in identical scenarios. For example, in one news-recommendation scenario, the performance of a content-based filtering approach was twice as high as the second-best approach, while in another scenario the same content-based filtering approach was the worst performing approach. We found several determinants that may contribute to the large discrepancies observed in recommendation effectiveness. Determinants we examined include user characteristics (gender and age), datasets, weighting schemes, the time at which recommendations were shown, and user-model size. Some of the determinants have interdependencies. For instance, the optimal size of an algorithms’ user model depended on users’ age. Since minor variations in approaches and scenarios can lead to significant changes in a recommendation approach’s performance, ensuring reproducibility of experimental results is difficult. We discuss these findings and conclude that to ensure reproducibility, the recommender-system community needs to (1) survey other research fields and learn from them, (2) find a common understanding of reproducibility, (3) identify and understand the determinants that affect reproducibility, (4) conduct more comprehensive experiments, (5) modernize publication practices, (6) foster the development and use of recommendation frameworks, and (7) establish best-practice guidelines for recommender-systems research.

[1]  Alan Said,et al.  Report on the workshop on reproducibility and replication in recommender systems evaluation (RepSys) , 2014, SIGF.

[2]  Andrew Turpin,et al.  Further Analysis of Whether Batch and User Evaluations Give the Same Results with a Question-Answering Task , 2000, TREC.

[3]  Bracha Shapira,et al.  Recommender Systems Handbook , 2015, Springer US.

[4]  P. Rothwell,et al.  Reproducibility of peer review in clinical neuroscience. Is agreement between reviewers any greater than would be expected by chance alone? , 2000, Brain : a journal of neurology.

[5]  Carl E. Rasmussen,et al.  The Need for Open Source Software in Machine Learning , 2007, J. Mach. Learn. Res..

[6]  Eric Horvitz,et al.  Collaborative Filtering by Personality Diagnosis: A Hybrid Memory and Model-Based Approach , 2000, UAI.

[7]  Alfred Kobsa,et al.  A pragmatic procedure to support the user-centric evaluation of recommender systems , 2011, RecSys '11.

[8]  John Riedl,et al.  Rethinking the recommender research ecosystem: reproducibility, openness, and LensKit , 2011, RecSys '11.

[9]  Mark Sanderson,et al.  The relationship between IR effectiveness measures and user satisfaction , 2007, SIGIR.

[10]  Andrew Turpin,et al.  Why batch and user evaluations do not give the same results , 2001, SIGIR '01.

[11]  Ellen M. Voorhees TREC: Improving information access through evaluation , 2006 .

[12]  Jöran Beel,et al.  Persistence in Recommender Systems: Giving the Same Recommendations to the Same Users Multiple Times , 2013, TPDL.

[13]  Franca Garzotto,et al.  Looking for "Good" Recommendations: A Comparative Evaluation of Recommender Systems , 2011, INTERACT.

[14]  Steven M Downing,et al.  Reliability: on the reproducibility of assessment data , 2004, Medical education.

[15]  Ralf Steinmetz,et al.  FReSET: an evaluation framework for folksonomy-based recommender systems , 2012, RSWeb@RecSys.

[16]  Jöran Beel,et al.  Introducing Docear's research paper recommender system , 2013, JCDL '13.

[17]  Andrew Turpin,et al.  Do batch and user evaluations give the same results? , 2000, SIGIR '00.

[18]  Franca Garzotto,et al.  Investigating the Persuasion Potential of Recommender Systems from a Quality Perspective: An Empirical Study , 2012, TIIS.

[19]  Jöran Beel,et al.  Exploring the Potential of User Modeling Based on Mind Maps , 2015, UMAP.

[20]  Michael Hahsler recommenderlab: An R Framework for Developing and Testing Recommendation Algorithms , 2022, ArXiv.

[21]  D Kromhout,et al.  Reproducibility of performance-based and self-reported measures of functional status. , 1997, The journals of gerontology. Series A, Biological sciences and medical sciences.

[22]  F. Cerase,et al.  Making social science matter: why social inquiry fails and how it can succeed again , 2002 .

[23]  Alan Said,et al.  Rival: a toolkit to foster reproducibility in recommender system evaluation , 2014, RecSys '14.

[24]  Li Chen,et al.  Evaluating recommender systems from the user’s perspective: survey of the state of the art , 2012, User Modeling and User-Adapted Interaction.

[25]  Langer Docear,et al.  The Comparability of Recommender System Evaluations and Characteristics of Docear ’ s Users , 2014 .

[26]  Guy Shani,et al.  A Survey of Accuracy Evaluation Metrics of Recommendation Tasks , 2009, J. Mach. Learn. Res..

[27]  Jöran Beel,et al.  A Comparison of Offline Evaluations, Online Evaluations, and User Studies in the Context of Research-Paper Recommender Systems , 2015, TPDL.

[28]  Lalita Sharma,et al.  A Survey of Recommendation System: Research Challenges , 2013 .

[29]  Guy Shani,et al.  Evaluating Recommendation Systems , 2011, Recommender Systems Handbook.

[30]  Michael C. Frank,et al.  Estimating the reproducibility of psychological science , 2015, Science.

[31]  Hang Li,et al.  Do clicks measure recommendation relevancy?: an empirical user study , 2010, RecSys '10.

[32]  Mouzhi Ge,et al.  Recommender Systems in Computer Science and Information Systems-a Landscape of Research , 2012 .

[33]  John Riedl,et al.  LensKit: a modular recommender framework , 2011, RecSys '11.

[34]  Jonathan L. Herlocker,et al.  Evaluating collaborative filtering recommender systems , 2004, TOIS.

[35]  Wenyi Huang,et al.  Recommending citations: translating papers into references , 2012, CIKM.

[36]  Bela Gipp,et al.  Research-paper recommender systems: a literature survey , 2015, International Journal on Digital Libraries.

[37]  John Riedl,et al.  GroupLens: an open architecture for collaborative filtering of netnews , 1994, CSCW '94.

[38]  Pádraig Cunningham,et al.  An on-line evaluation framework for recommender systems , 2002 .

[39]  Hongfei Yan,et al.  Recommending citations with translation model , 2011, CIKM '11.

[40]  Alan Said,et al.  Evaluating the Accuracy and Utility of Recommender Systems , 2013 .

[41]  Antal van den Bosch,et al.  Recommending scientific articles using citeulike , 2008, RecSys '08.

[42]  Jöran Beel,et al.  A comparative analysis of offline and online evaluations and discussion of research paper recommender system evaluation , 2013, RepSys '13.

[43]  Gerhard Friedrich,et al.  Recommender Systems - An Introduction , 2010 .

[44]  Martha Larson,et al.  Collaborative Filtering beyond the User-Item Matrix , 2014, ACM Comput. Surv..

[45]  Sean M. McNee,et al.  On the recommending of citations for research papers , 2002, CSCW '02.

[46]  Gary James Jason,et al.  The Logic of Scientific Discovery , 1988 .

[47]  Katja Hofmann,et al.  Effects of Position Bias on Click-Based Recommender Evaluation , 2014, ECIR.

[48]  S. Hewitt,et al.  Reproducibility , 2019, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[49]  A. Casadevall,et al.  Reproducible Science , 2010, Infection and Immunity.

[50]  Jöran Beel,et al.  'SciPlore MindMapping' : A Tool for Creating Mind Maps Combined with PDF and Reference Management , 2009 .

[51]  John Riedl,et al.  Recommender systems: from algorithms to user experience , 2012, User Modeling and User-Adapted Interaction.

[52]  Dominik Kowald,et al.  TagRec: towards a standardized tag recommender benchmarking framework , 2014, HT.

[53]  CraswellNick,et al.  Results and challenges in Web search evaluation , 1999 .

[54]  David L. Donoho,et al.  WaveLab and Reproducible Research , 1995 .

[55]  Katrien Verbert,et al.  Layered Evaluation of Multi-Criteria Collaborative Filtering for Scientific Paper Recommendation , 2013, ICCS.

[56]  Raymond J. Mooney,et al.  Content-boosted collaborative filtering for improved recommendations , 2002, AAAI/IAAI.

[57]  Dietmar Jannach,et al.  What Recommenders Recommend - An Analysis of Accuracy, Popularity, and Sales Diversity Effects , 2013, UMAP.

[58]  Sean M. McNee,et al.  Enhancing digital libraries with TechLens , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[59]  Joseph A. Konstan,et al.  Introduction to recommender systems , 2008, SIGMOD Conference.

[60]  Elaine Rich,et al.  User Modeling via Stereotypes , 1998, Cogn. Sci..

[61]  S. Schmidt Shall we Really do it Again? The Powerful Concept of Replication is Neglected in the Social Sciences , 2009 .

[62]  John Riedl,et al.  Automatically building research reading lists , 2010, RecSys '10.

[63]  Neil Yorke-Smith,et al.  LibRec: A Java Library for Recommender Systems , 2015, UMAP Workshops.

[64]  Jöran Beel,et al.  The Architecture and Datasets of Docear's Research Paper Recommender System , 2014, D Lib Mag..

[65]  Jöran Beel,et al.  Sponsored vs. Organic (Research Paper) Recommendations and the Impact of Labeling , 2013, TPDL.

[66]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[67]  Jöran Beel,et al.  SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size) , 2010, ECDL.

[68]  Lior Rokach,et al.  Introduction to Recommender Systems Handbook , 2011, Recommender Systems Handbook.

[69]  Iryna Gurevych,et al.  A lightweight framework for reproducible parameter sweeping in information retrieval , 2011, DESIRE '11.

[70]  Daniel Kifer,et al.  Context-aware citation recommendation , 2010, WWW '10.

[71]  Jöran Beel,et al.  Towards effective research-paper recommender systems and user modeling based on mind maps , 2017, ArXiv.

[72]  Bart P. Knijnenburg,et al.  Explaining the user experience of recommender systems , 2012, User Modeling and User-Adapted Interaction.

[73]  Sean M. McNee,et al.  Don't look stupid: avoiding pitfalls when recommending research papers , 2006, CSCW '06.

[74]  C. Drummond Replicability is not Reproducibility:Nor is it Good Science , 2009 .

[75]  Mohsen Kahani,et al.  SemCiR: A citation recommendation system based on a novel semantic distance measure , 2013, Program.

[76]  鄭宇庭 行銷硏究 : Marketing research , 2009 .

[77]  David M. Pennock,et al.  Categories and Subject Descriptors , 2001 .

[78]  Johan Bollen,et al.  An Adaptive Systems Approach to the Implementation and Evaluation of Digital Library Recommendation Systems , 2000, ECDL.

[79]  Alexander Felfernig,et al.  Toward the Next Generation of Recommender Systems: Applications and Research Challenges , 2013 .

[80]  M. Kendall,et al.  The Logic of Scientific Discovery. , 1959 .

[81]  R A Deyo,et al.  Reproducibility and responsiveness of health status measures. Statistics and strategies for evaluation. , 1991, Controlled clinical trials.

[82]  Jöran Beel,et al.  The Impact of Demographics (Age and Gender) and Other User-Characteristics on Evaluating Recommender Systems , 2013, TPDL.

[83]  B. Flyvbjerg,et al.  Making Social Science Matter: Why Social Inquiry Fails and How It Can Succeed Again , 2001 .

[84]  Jian-Yun Nie,et al.  Position-Aligned Translation Model for Citation Recommendation , 2012, SPIRE.

[85]  Antal van den Bosch,et al.  Comparing and evaluating information retrieval algorithms for news recommendation , 2007, RecSys '07.

[86]  Nuria Oliver,et al.  I Like It... I Like It Not: Evaluating User Ratings Noise in Recommender Systems , 2009, UMAP.

[87]  Daniel Jurafsky,et al.  Who should I cite: learning literature search models from citation behavior , 2010, CIKM.

[88]  Lars Schmidt-Thieme,et al.  MyMediaLite: a free recommender system library , 2011, RecSys '11.

[89]  J. Bobadilla,et al.  Recommender systems survey , 2013, Knowl. Based Syst..

[90]  Mouzhi Ge,et al.  Beyond accuracy: evaluating recommender systems by coverage and serendipity , 2010, RecSys '10.

[91]  Jöran Beel,et al.  Utilizing Mind-Maps for Information Retrieval and User Modelling , 2014, UMAP.

[92]  Jenny Davies,et al.  An investigation into the concept of Mind Mapping and the use of Mind Mapping software to support and improve student academic performance’. Learning and Teaching Projects / p . , 2003 .

[93]  David Heckerman,et al.  Empirical Analysis of Predictive Algorithms for Collaborative Filtering , 1998, UAI.

[94]  Li Chen,et al.  A user-centric evaluation framework for recommender systems , 2011, RecSys '11.

[95]  Judy Kay,et al.  Coming of age: celebrating a quarter century of user modeling and personalization: Guest editors’ introduction , 2011, User Modeling and User-Adapted Interaction.

[96]  Donna K. Harman,et al.  Results and Challenges in Web Search Evaluation , 1999, Comput. Networks.

[97]  Jöran Beel,et al.  Docear's PDF inspector: title extraction from PDF files , 2013, JCDL '13.

[98]  Andreas Lommatzsch,et al.  Real-Time News Recommendation Using Context-Aware Ensembles , 2014, ECIR.

[99]  Jöran Beel,et al.  Docear: an academic literature suite for searching, organizing and creating academic literature , 2011, JCDL '11.

[100]  Martin Davies Concept mapping, mind mapping and argument mapping: what are the differences and do they matter? , 2011 .

[101]  Andreas Nürnberger,et al.  Research paper recommender system evaluation: a quantitative literature survey , 2013, RepSys '13.

[102]  B. Flyvbjerg Making Social Science Matter , 2001 .

[103]  Gediminas Adomavicius,et al.  Toward identification and adoption of best practices in algorithmic recommender systems research , 2013, RepSys '13.