论文信息 - Towards reproducibility in recommender-systems research

Towards reproducibility in recommender-systems research

Numerous recommendation approaches are in use today. However, comparing their effectiveness is a challenging task because evaluation results are rarely reproducible. In this article, we examine the challenge of reproducibility in recommender-system research. We conduct experiments using Plista’s news recommender system, and Docear’s research-paper recommender system. The experiments show that there are large discrepancies in the effectiveness of identical recommendation approaches in only slightly different scenarios, as well as large discrepancies for slightly different approaches in identical scenarios. For example, in one news-recommendation scenario, the performance of a content-based filtering approach was twice as high as the second-best approach, while in another scenario the same content-based filtering approach was the worst performing approach. We found several determinants that may contribute to the large discrepancies observed in recommendation effectiveness. Determinants we examined include user characteristics (gender and age), datasets, weighting schemes, the time at which recommendations were shown, and user-model size. Some of the determinants have interdependencies. For instance, the optimal size of an algorithms’ user model depended on users’ age. Since minor variations in approaches and scenarios can lead to significant changes in a recommendation approach’s performance, ensuring reproducibility of experimental results is difficult. We discuss these findings and conclude that to ensure reproducibility, the recommender-system community needs to (1) survey other research fields and learn from them, (2) find a common understanding of reproducibility, (3) identify and understand the determinants that affect reproducibility, (4) conduct more comprehensive experiments, (5) modernize publication practices, (6) foster the development and use of recommendation frameworks, and (7) establish best-practice guidelines for recommender-systems research.

[1] Alan Said,et al. Report on the workshop on reproducibility and replication in recommender systems evaluation (RepSys) , 2014, SIGF.

[2] Andrew Turpin,et al. Further Analysis of Whether Batch and User Evaluations Give the Same Results with a Question-Answering Task , 2000, TREC.

[3] Bracha Shapira,et al. Recommender Systems Handbook , 2015, Springer US.

[4] P. Rothwell,et al. Reproducibility of peer review in clinical neuroscience. Is agreement between reviewers any greater than would be expected by chance alone? , 2000, Brain : a journal of neurology.

[5] Carl E. Rasmussen,et al. The Need for Open Source Software in Machine Learning , 2007, J. Mach. Learn. Res..

[6] Eric Horvitz,et al. Collaborative Filtering by Personality Diagnosis: A Hybrid Memory and Model-Based Approach , 2000, UAI.

[7] Alfred Kobsa,et al. A pragmatic procedure to support the user-centric evaluation of recommender systems , 2011, RecSys '11.

[8] John Riedl,et al. Rethinking the recommender research ecosystem: reproducibility, openness, and LensKit , 2011, RecSys '11.

[9] Mark Sanderson,et al. The relationship between IR effectiveness measures and user satisfaction , 2007, SIGIR.

[10] Andrew Turpin,et al. Why batch and user evaluations do not give the same results , 2001, SIGIR '01.

[11] Ellen M. Voorhees. TREC: Improving information access through evaluation , 2006 .

[12] Jöran Beel,et al. Persistence in Recommender Systems: Giving the Same Recommendations to the Same Users Multiple Times , 2013, TPDL.

[13] Franca Garzotto,et al. Looking for "Good" Recommendations: A Comparative Evaluation of Recommender Systems , 2011, INTERACT.

[14] Steven M Downing,et al. Reliability: on the reproducibility of assessment data , 2004, Medical education.

[15] Ralf Steinmetz,et al. FReSET: an evaluation framework for folksonomy-based recommender systems , 2012, RSWeb@RecSys.

[16] Jöran Beel,et al. Introducing Docear's research paper recommender system , 2013, JCDL '13.

[17] Andrew Turpin,et al. Do batch and user evaluations give the same results? , 2000, SIGIR '00.

[18] Franca Garzotto,et al. Investigating the Persuasion Potential of Recommender Systems from a Quality Perspective: An Empirical Study , 2012, TIIS.

[19] Jöran Beel,et al. Exploring the Potential of User Modeling Based on Mind Maps , 2015, UMAP.

[20] Michael Hahsler. recommenderlab: An R Framework for Developing and Testing Recommendation Algorithms , 2022, ArXiv.

[21] D Kromhout,et al. Reproducibility of performance-based and self-reported measures of functional status. , 1997, The journals of gerontology. Series A, Biological sciences and medical sciences.

[22] F. Cerase,et al. Making social science matter: why social inquiry fails and how it can succeed again , 2002 .

[23] Alan Said,et al. Rival: a toolkit to foster reproducibility in recommender system evaluation , 2014, RecSys '14.

[24] Li Chen,et al. Evaluating recommender systems from the user’s perspective: survey of the state of the art , 2012, User Modeling and User-Adapted Interaction.

[25] Langer Docear,et al. The Comparability of Recommender System Evaluations and Characteristics of Docear ’ s Users , 2014 .

[26] Guy Shani,et al. A Survey of Accuracy Evaluation Metrics of Recommendation Tasks , 2009, J. Mach. Learn. Res..

[27] Jöran Beel,et al. A Comparison of Offline Evaluations, Online Evaluations, and User Studies in the Context of Research-Paper Recommender Systems , 2015, TPDL.

[28] Lalita Sharma,et al. A Survey of Recommendation System: Research Challenges , 2013 .

[29] Guy Shani,et al. Evaluating Recommendation Systems , 2011, Recommender Systems Handbook.

[30] Michael C. Frank,et al. Estimating the reproducibility of psychological science , 2015, Science.

[31] Hang Li,et al. Do clicks measure recommendation relevancy?: an empirical user study , 2010, RecSys '10.

[32] Mouzhi Ge,et al. Recommender Systems in Computer Science and Information Systems-a Landscape of Research , 2012 .

[33] John Riedl,et al. LensKit: a modular recommender framework , 2011, RecSys '11.

[34] Jonathan L. Herlocker,et al. Evaluating collaborative filtering recommender systems , 2004, TOIS.

[35] Wenyi Huang,et al. Recommending citations: translating papers into references , 2012, CIKM.

[36] Bela Gipp,et al. Research-paper recommender systems: a literature survey , 2015, International Journal on Digital Libraries.

[37] John Riedl,et al. GroupLens: an open architecture for collaborative filtering of netnews , 1994, CSCW '94.

[38] Pádraig Cunningham,et al. An on-line evaluation framework for recommender systems , 2002 .

[39] Hongfei Yan,et al. Recommending citations with translation model , 2011, CIKM '11.

[40] Alan Said,et al. Evaluating the Accuracy and Utility of Recommender Systems , 2013 .

[41] Antal van den Bosch,et al. Recommending scientific articles using citeulike , 2008, RecSys '08.

[42] Jöran Beel,et al. A comparative analysis of offline and online evaluations and discussion of research paper recommender system evaluation , 2013, RepSys '13.

[43] Gerhard Friedrich,et al. Recommender Systems - An Introduction , 2010 .

[44] Martha Larson,et al. Collaborative Filtering beyond the User-Item Matrix , 2014, ACM Comput. Surv..

[45] Sean M. McNee,et al. On the recommending of citations for research papers , 2002, CSCW '02.

[46] Gary James Jason,et al. The Logic of Scientific Discovery , 1988 .

[47] Katja Hofmann,et al. Effects of Position Bias on Click-Based Recommender Evaluation , 2014, ECIR.

[48] S. Hewitt,et al. Reproducibility , 2019, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[49] A. Casadevall,et al. Reproducible Science , 2010, Infection and Immunity.

[50] Jöran Beel,et al. 'SciPlore MindMapping' : A Tool for Creating Mind Maps Combined with PDF and Reference Management , 2009 .

[51] John Riedl,et al. Recommender systems: from algorithms to user experience , 2012, User Modeling and User-Adapted Interaction.

[52] Dominik Kowald,et al. TagRec: towards a standardized tag recommender benchmarking framework , 2014, HT.

[53] CraswellNick,et al. Results and challenges in Web search evaluation , 1999 .

[54] David L. Donoho,et al. WaveLab and Reproducible Research , 1995 .

[55] Katrien Verbert,et al. Layered Evaluation of Multi-Criteria Collaborative Filtering for Scientific Paper Recommendation , 2013, ICCS.

[56] Raymond J. Mooney,et al. Content-boosted collaborative filtering for improved recommendations , 2002, AAAI/IAAI.

[57] Dietmar Jannach,et al. What Recommenders Recommend - An Analysis of Accuracy, Popularity, and Sales Diversity Effects , 2013, UMAP.

[58] Sean M. McNee,et al. Enhancing digital libraries with TechLens , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[59] Joseph A. Konstan,et al. Introduction to recommender systems , 2008, SIGMOD Conference.

[60] Elaine Rich,et al. User Modeling via Stereotypes , 1998, Cogn. Sci..

[61] S. Schmidt. Shall we Really do it Again? The Powerful Concept of Replication is Neglected in the Social Sciences , 2009 .

[62] John Riedl,et al. Automatically building research reading lists , 2010, RecSys '10.

[63] Neil Yorke-Smith,et al. LibRec: A Java Library for Recommender Systems , 2015, UMAP Workshops.

[64] Jöran Beel,et al. The Architecture and Datasets of Docear's Research Paper Recommender System , 2014, D Lib Mag..

[65] Jöran Beel,et al. Sponsored vs. Organic (Research Paper) Recommendations and the Impact of Labeling , 2013, TPDL.

[66] Isabelle Guyon,et al. An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[67] Jöran Beel,et al. SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size) , 2010, ECDL.

[68] Lior Rokach,et al. Introduction to Recommender Systems Handbook , 2011, Recommender Systems Handbook.

[69] Iryna Gurevych,et al. A lightweight framework for reproducible parameter sweeping in information retrieval , 2011, DESIRE '11.

[70] Daniel Kifer,et al. Context-aware citation recommendation , 2010, WWW '10.