Things Change: Comparing Results Using Historical Data and User Testing for Evaluating a Recommendation Task

We address a recommendation task for next likely flight destination to customers of a major international airline company. We compare performance using historical flight data and an actual user evaluation. Using two years of historical flight data consisting of tens of millions of flights, an ensemble and a collaborative filtering approach obtained an accuracy of 47% and 20% using a test set of 100,000 customers, respectively, highlighting the challenge of the domain. We then evaluated our recommendations on 10,000 actual customers, with a 45-45-10 split among ensemble, collaborative filtering, and control group. The overall predictive power employed with real users was 23%, with the ensemble method having a predictive power of 19% and 30% for collaborative filtering. Results indicate that, in complex and shifting domains such as this one, one cannot rely solely on historical data for evaluating the impact of user recommendations. We discuss implications for recommendation systems and future research in this and related domains.

[1]  Jöran Beel,et al.  A comparative analysis of offline and online evaluations and discussion of research paper recommender system evaluation , 2013, RepSys '13.

[2]  Shini Renjith,et al.  An extensive study on the evolution of context-aware personalized travel recommender systems , 2020, Inf. Process. Manag..

[3]  Jonathan L. Herlocker,et al.  Evaluating collaborative filtering recommender systems , 2004, TOIS.

[4]  Weiwei Deng,et al.  Model Ensemble for Click Prediction in Bing Search Ads , 2017, WWW.

[5]  John P. Curtin,et al.  Developing an airline freight management system: meeting airline and end-user challenges , 2003, CHI Extended Abstracts.

[6]  Sien Chen,et al.  Airlines Content Recommendations Based on Passengers' Choice Using Bayesian Belief Networks , 2017 .

[7]  Feras Al-Obeidat,et al.  User community detection via embedding of social network structure and temporal content , 2020, Inf. Process. Manag..

[8]  Alejandro Bellogín,et al.  Building user profiles based on sequences for content and collaborative filtering , 2019, Inf. Process. Manag..

[9]  Bart P. Knijnenburg,et al.  Evaluating Recommender Systems with User Experiments , 2015, Recommender Systems Handbook.

[10]  Xing Xie,et al.  Collaborative filtering meets next check-in location prediction , 2013, WWW.

[11]  Andy Cockburn,et al.  AccessRank: predicting what users will do next , 2012, CHI.

[12]  Harald Reiterer,et al.  Comparing Sequential and Temporal Patterns from Human Mobility Data for Next-Place Prediction , 2018, UMAP.

[13]  Kristina Höök,et al.  CHI '12 Extended Abstracts on Human Factors in Computing Systems , 2012, CHI 2012.

[14]  Li Chen,et al.  A user-centric evaluation framework for recommender systems , 2011, RecSys '11.

[15]  Richard D. Lawrence,et al.  Passenger-based predictive modeling of airline no-show rates , 2003, KDD '03.

[16]  Christoph Hueglin,et al.  Data mining techniques to improve forecast accuracy in airline business , 2001, KDD '01.

[17]  Ulrich Paquet,et al.  Beyond Collaborative Filtering: The List Recommendation Problem , 2016, WWW.