Continuous Evaluation of Large-Scale Information Access Systems: A Case for Living Labs

A/B testing is currently being increasingly adopted for the evaluation of commercial information access systems with a large user base since it provides the advantage of observing the efficiency and effectiveness of information access systems under real conditions. Unfortunately, unless university-based researchers closely collaborate with industry or develop their own infrastructure or user base, they cannot validate their ideas in live settings with real users. Without online testing opportunities open to the research communities, academic researchers are unable to employ online evaluation on a larger scale. This means that they do not get feedback for their ideas and cannot advance their research further. Businesses, on the other hand, miss the opportunity to have higher customer satisfaction due to improved systems. In addition, users miss the chance to benefit from an improved information access system. In this chapter, we introduce two evaluation initiatives at CLEF, NewsREEL and Living Labs for IR (LL4IR), that aim to address this growing “evaluation gap” between academia and industry. We explain the challenges and discuss the experiences organizing theses living labs.

[1]  Filip Radlinski,et al.  Large-scale validation and analysis of interleaved search evaluation , 2012, TOIS.

[2]  Filip Radlinski,et al.  Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search , 2007, TOIS.

[3]  Ron Kohavi Online Controlled Experiments: Lessons from Running A/B/n Tests for 12 Years , 2015, KDD.

[4]  W. Bruce Croft,et al.  A Probabilistic Retrieval Model for Semistructured Data , 2009, ECIR.

[5]  Fernando Diaz,et al.  Robust models of mouse movement on dynamic web search results pages , 2013, CIKM.

[6]  Andreas Lommatzsch,et al.  Development and Evaluation of a Highly Scalable News Recommender System , 2015, CLEF.

[7]  Nick Craswell,et al.  Beyond clicks: query reformulation as a predictor of search satisfaction , 2013, CIKM.

[8]  Krisztian Balog,et al.  Towards a Living Lab for Information Retrieval Research and Development - A Proposal for a Living Lab for Product Search Tasks , 2011, CLEF.

[9]  Guy Shani,et al.  A Survey of Accuracy Evaluation Metrics of Recommendation Tasks , 2009, J. Mach. Learn. Res..

[10]  Filip Radlinski,et al.  Relevance and Effort: An Analysis of Document Utility , 2014, CIKM.

[11]  Sahin Albayrak,et al.  Real-time recommendations for user-item streams , 2015, SAC.

[12]  Juliana Freire,et al.  Reproducibility of Data-Oriented Experiments in e-Science (Dagstuhl Seminar 16041) , 2016, Dagstuhl Reports.

[13]  Katja Hofmann,et al.  A probabilistic method for inferring preferences from clicks , 2011, CIKM '11.

[14]  M. de Rijke,et al.  Probabilistic Multileave for Online Retrieval Evaluation , 2015, SIGIR.

[15]  Martha Larson,et al.  Stream-Based Recommendations: Online and Offline Evaluation as a Service , 2015, CLEF.

[16]  David Hawking,et al.  If SIGIR had an Academic Track, What Would Be In It? , 2015, SIGIR.

[17]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[18]  Maarten de Rijke,et al.  OpenSearch: Lessons Learned from an Online Evaluation Campaign , 2018, ACM J. Data Inf. Qual..

[19]  Krisztian Balog,et al.  Head First: Living Labs for Ad-hoc Search Evaluation , 2014, CIKM.

[20]  Carol Peters,et al.  Report on the SIGIR 2009 workshop on the future of IR evaluation , 2009, SIGF.

[21]  Kevin C. Almeroth,et al.  Workshop and challenge on news recommender systems , 2013, RecSys.

[22]  Krisztian Balog,et al.  Towards a living lab for information retrieval research and development: a proposal for a living lab for product search tasks , 2011 .

[23]  Frank Hopfgartner,et al.  An experimental evaluation of ontology-based user profiles , 2014, Multimedia Tools and Applications.

[24]  Martha Larson,et al.  Recommender Systems Evaluation: A 3D Benchmark , 2012, RUE@RecSys.

[25]  Martha Larson,et al.  Idomaar: A Framework for Multi-dimensional Benchmarking of Recommender Algorithms , 2016, RecSys Posters.

[26]  Frank Hopfgartner,et al.  Benchmarking News Recommendations in a Living Lab , 2014, CLEF.

[27]  Frank Hopfgartner,et al.  Join the Living Lab: Evaluating News Recommendations in Real-Time , 2015, ECIR.

[28]  Filip Radlinski,et al.  Predicting Search Satisfaction Metrics with Interleaved Comparisons , 2015, SIGIR.

[29]  Martha Larson,et al.  Benchmarking News Recommendations: The CLEF NewsREEL Use Case , 2016, SIGF.

[30]  Tao Qin,et al.  LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval , 2007 .

[31]  Frank Hopfgartner,et al.  The Potentials of Recommender Systems Challenges for Student Learning , 2016, NIPS 2016.

[32]  Stefano Mizzaro,et al.  Reproduce and Improve , 2018, ACM J. Data Inf. Qual..

[33]  Xiaolong Li,et al.  Inferring search behaviors using partially observable Markov (POM) model , 2010, WSDM '10.

[34]  James Allan,et al.  Frontiers, challenges, and opportunities for information retrieval: Report from SWIRL 2012 the second strategic workshop on information retrieval in Lorne , 2012, SIGF.

[35]  Frank Hopfgartner,et al.  Real-time Recommendation of Streamed Data , 2015, RecSys.

[36]  Susan T. Dumais,et al.  Evaluation Challenges and Directions for Information-Seeking Support Systems , 2009, Computer.

[37]  Gareth J. F. Jones,et al.  Evaluating Personal Information Retrieval , 2012, ECIR.

[38]  Frank Hopfgartner,et al.  CLEF 2017 NewsREEL Overview: A Stream-Based Recommender Task for Evaluation and Education , 2017, CLEF.

[39]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[40]  Filip Radlinski,et al.  Optimized interleaving for online retrieval evaluation , 2013, WSDM.

[41]  Ryen W. White,et al.  Modeling dwell time to predict click-level satisfaction , 2014, WSDM.

[42]  Falk Scholer,et al.  User performance versus precision measures for simple search tasks , 2006, SIGIR.

[43]  Jimmy J. Lin,et al.  Evaluation-as-a-Service for the Computational Sciences , 2018, ACM J. Data Inf. Qual..

[44]  Tie-Yan Liu Learning to Rank for Information Retrieval , 2009, Found. Trends Inf. Retr..

[45]  Mark D. Smucker,et al.  Report on the CIKM workshop on living labs for information retrieval evaluation , 2014, SIGF.

[46]  Frank Hopfgartner,et al.  Shedding light on a living lab: the CLEF NEWSREEL open recommendation platform , 2014, IIiX.

[47]  M. de Rijke,et al.  Multileaved Comparisons for Fast Online Evaluation , 2014, CIKM.

[48]  Krisztian Balog,et al.  Extended Overview of the Living Labs for Information Retrieval Evaluation (LL4IR) CLEF Lab 2015 , 2015, CLEF.

[49]  Jane Li,et al.  Good abandonment in mobile and PC internet search , 2009, SIGIR.

[50]  Jimmy J. Lin,et al.  Evaluation-as-a-Service: Overview and Outlook , 2015, ArXiv.