Statistically Robust Evaluation of Stream-Based Recommender Systems

Online incremental models for recommendation are nowadays pervasive in both the industry and the academia. However, there is not yet a standard evaluation methodology for the algorithms that maintain such models. Moreover, online evaluation methodologies available in the literature generally fall short on the statistical validation of results, since this validation is not trivially applicable to stream-based algorithms. We propose a k-fold validation framework for the pairwise comparison of recommendation algorithms that learn from user feedback streams, using prequential evaluation. Our proposal enables continuous statistical testing on adaptive-size sliding windows over the outcome of the prequential process, allowing practitioners and researchers to make decisions in real time based on solid statistical evidence. We present a set of experiments to gain insights on the sensitivity and robustness of two statistical tests - McNemar's and Wilcoxon signed rank - in a streaming data environment. Our results show that besides allowing a real-time, fine-grained online assessment, the online versions of the statistical tests are at least as robust as the batch versions, and definitely more robust than a simple prequential single-fold approach.

[1]  S. García,et al.  An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[2]  Myra Spiliopoulou,et al.  xStreams: Recommending Items to Users with Time-evolving Preferences , 2014, WIMS '14.

[3]  João Gama,et al.  On evaluating stream learning algorithms , 2012, Machine Learning.

[4]  João Gama,et al.  An overview on the exploitation of time in collaborative filtering , 2015, WIREs Data Mining Knowl. Discov..

[5]  Fikret S. Gürgen,et al.  Scalable and adaptive collaborative filtering by mining frequent item co-occurrences in a user feedback stream , 2017, Eng. Appl. Artif. Intell..

[6]  Naonori Ueda,et al.  Online Passive-Aggressive Algorithms for Non-Negative Matrix Factorization and Completion , 2014, AISTATS.

[7]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[8]  Chia-Yu Lin,et al.  Hybrid Real-Time Matrix Factorization for Implicit Feedback Recommendation Systems , 2018, IEEE Access.

[9]  Ricard Gavaldà,et al.  Adaptive Learning from Evolving Data Streams , 2009, IDA.

[10]  András A. Benczúr,et al.  Exploiting temporal influence in online recommendation , 2014, RecSys '14.

[11]  Ole J. Mengshoel,et al.  Incremental learning for matrix factorization in recommender systems , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[12]  Tat-Seng Chua,et al.  Fast Matrix Factorization for Online Recommendation with Implicit Feedback , 2016, SIGIR.

[13]  Myra Spiliopoulou,et al.  Forgetting methods for incremental matrix factorization in recommender systems , 2015, SAC.

[14]  Francesca Mangili,et al.  Should We Really Use Post-Hoc Tests Based on Mean-Ranks? , 2015, J. Mach. Learn. Res..

[15]  Talel Abdessalem,et al.  Dynamic Local Models for Online Recommendation , 2018, WWW.

[16]  Jon Atle Gulla,et al.  Real-time social recommendation based on graph embedding and temporal context , 2019, Int. J. Hum. Comput. Stud..

[17]  Yanxiang Huang,et al.  Real-time Video Recommendation Exploration , 2016, SIGMOD Conference.

[18]  Myra Spiliopoulou,et al.  Forgetting techniques for stream-based matrix factorization in recommender systems , 2017, Knowledge and Information Systems.

[19]  Talel Abdessalem,et al.  Adaptive collaborative topic modeling for online recommendation , 2018, RecSys.

[20]  João Gama,et al.  Fast Incremental Matrix Factorization for Recommendation with Positive-Only Feedback , 2014, UMAP.

[21]  Ron Kohavi,et al.  Controlled experiments on the web: survey and practical guide , 2009, Data Mining and Knowledge Discovery.

[22]  Gediminas Adomavicius,et al.  Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions , 2005, IEEE Transactions on Knowledge and Data Engineering.

[23]  Iván Cantador,et al.  Time-aware recommender systems: a comprehensive survey and analysis of existing evaluation protocols , 2013, User Modeling and User-Adapted Interaction.

[24]  G DietterichThomas Approximate statistical tests for comparing supervised classification learning algorithms , 1998 .

[25]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[26]  Li Chen,et al.  Evaluating recommender systems from the user’s perspective: survey of the state of the art , 2012, User Modeling and User-Adapted Interaction.

[27]  Takuya Kitazawa Sketching Dynamic User-Item Interactions for Online Item Recommendation , 2017, CHIIR.

[28]  Yijun Wang,et al.  Incremental Matrix Factorization: A Linear Feature Transformation Perspective , 2017, IJCAI.

[29]  Jonathan L. Herlocker,et al.  Evaluating collaborative filtering recommender systems , 2004, TOIS.

[30]  João Gama,et al.  Evaluation of recommender systems in streaming environments , 2015, ArXiv.

[31]  Lars Schmidt-Thieme,et al.  Real-time top-n recommendation in social streams , 2012, RecSys.

[32]  Min Wu,et al.  Online Collaborative Filtering with Implicit Feedback , 2019, DASFAA.

[33]  Albert Bifet,et al.  Efficient Online Evaluation of Big Data Stream Classifiers , 2015, KDD.

[34]  Charu C. Aggarwal,et al.  Recommendations For Streaming Data , 2016, CIKM.