Online In-Situ Interleaved Evaluation of Real-Time Push Notification Systems

Real-time push notification systems monitor continuous document streams such as social media posts and alert users to relevant content directly on their mobile devices. We describe a user study of such systems in the context of the TREC 2016 Real-Time Summarization Track, where system updates are immediately delivered as push notifications to the mobile devices of a cohort of users. Our study represents, to our knowledge, the first deployment of an interleaved evaluation framework for prospective information needs, and also provides an opportunity to examine user behavior in a realistic setting. Results of our online in-situ evaluation are correlated against the results a more traditional post-hoc batch evaluation. We observe substantial correlations between many online and batch evaluation metrics, especially for those that share the same basic design (e.g., are utility-based). For some metrics, we observe little correlation, but are able to identify the volume of messages that a system pushes as one major source of differences.

[1]  Andrew Turpin,et al.  Do batch and user evaluations give the same results? , 2000, SIGIR '00.

[2]  Craig MacDonald,et al.  Overview of the TREC-2012 Microblog Track , 2012, Text Retrieval Conference.

[3]  Elad Yom-Tov,et al.  Updating Users about Time Critical Events , 2013, ECIR.

[4]  Mark Sanderson,et al.  The good and the bad system: does the test collection predict users' effectiveness? , 2008, SIGIR '08.

[5]  Falk Scholer,et al.  User performance versus precision measures for simple search tasks , 2006, SIGIR.

[6]  Jimmy J. Lin,et al.  Overview of the TREC 2017 Real-Time Summarization Track , 2017, TREC.

[7]  Charles L. A. Clarke,et al.  An Exploration of Evaluation Metrics for Mobile Push Notifications , 2016, SIGIR.

[8]  Jimmy J. Lin,et al.  A month in the life of a production news recommender system , 2013, LivingLab '13.

[9]  Filip Radlinski,et al.  Optimized interleaving for online retrieval evaluation , 2013, WSDM.

[10]  James Allan,et al.  When Will Information Retrieval Be "Good Enough"? User Effectiveness As a Function of Retrieval Accuracy , 2005 .

[11]  Andrew Turpin,et al.  Further Analysis of Whether Batch and User Evaluations Give the Same Results with a Question-Answering Task , 2000, TREC.

[12]  Nicholas J. Belkin,et al.  Information filtering and information retrieval: two sides of the same coin? , 1992, CACM.

[13]  Mirco Musolesi,et al.  My Phone and Me: Understanding People's Receptivity to Mobile Notifications , 2016, CHI.

[14]  James Allan,et al.  When will information retrieval be "good enough"? , 2005, SIGIR '05.

[15]  Mark Sanderson,et al.  Do user preferences and evaluation measures line up? , 2010, SIGIR.

[16]  Andrew Turpin,et al.  Why batch and user evaluations do not give the same results , 2001, SIGIR '01.

[17]  Katja Hofmann,et al.  A probabilistic method for inferring preferences from clicks , 2011, CIKM '11.

[18]  Jimmy J. Lin,et al.  Interleaved Evaluation for Retrospective Summarization and Prospective Notification on Document Streams , 2016, SIGIR.

[19]  Filip Radlinski,et al.  Comparing the sensitivity of information retrieval metrics , 2010, SIGIR.

[20]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[21]  Mark D. Smucker,et al.  Human performance and retrieval precision revisited , 2010, SIGIR.

[22]  Filip Radlinski,et al.  Predicting Search Satisfaction Metrics with Interleaved Comparisons , 2015, SIGIR.

[23]  Krisztian Balog,et al.  Extended Overview of the Living Labs for Information Retrieval Evaluation (LL4IR) CLEF Lab 2015 , 2015, CLEF.

[24]  Jimmy J. Lin,et al.  Evaluation-as-a-Service: Overview and Outlook , 2015, ArXiv.

[25]  Tetsuya Sakai,et al.  TREC 2014 Temporal Summarization Track Overview , 2014, TREC.

[26]  Filip Radlinski,et al.  Large-scale validation and analysis of interleaved search evaluation , 2012, TOIS.

[27]  Jimmy J. Lin,et al.  Overview of the TREC-2015 Microblog Track , 2015, TREC.

[28]  Peter Bailey,et al.  Relevance assessment: are judges exchangeable and does it matter , 2008, SIGIR '08.

[29]  Jimmy J. Lin,et al.  Assessor Differences and User Preferences in Tweet Timeline Generation , 2015, SIGIR.

[30]  Stephen E. Robertson,et al.  The TREC 2002 Filtering Track Report , 2002, TREC.

[31]  Jimmy J. Lin,et al.  Overview of the TREC-2014 Microblog Track , 2014, TREC.

[32]  David D. Lewis The TREC-4 Filtering Track , 1995, TREC.

[33]  Ron Kohavi,et al.  Practical guide to controlled experiments on the web: listen to your customers not to the hippo , 2007, KDD '07.