Investigating per Topic Upper Bound for Session Search Evaluation

Session search is a complex Information Retrieval (IR) task. As a result, its evaluation is also complex. A great number of factors need to be considered in the evaluation of session search. They include document relevance, document novelty, aspect-related novelty discounting, and user's efforts in examining the documents. Due to increased complexity, most existing session search evaluation metrics are NP-hard. Consequently, the optimal value, i.e. the upper bound, of a metric highly varies with the actual search topics. In Cranfield-like settings such as the Text REtrieval Conference (TREC), scores for systems are usually averaged across all search topics. With undetermined upper bound values, however, it could be unfair to compare IR systems across different topics. This paper addresses the problem by investigating the actual per topic upper bounds of existing session search metrics. Through decomposing the metrics, we derive the upper bounds via mathematical optimization. We show that after being normalized by the bounds, the NP-hard session search metrics are then able to provide robust comparison across various search topics. The new normalized metrics are experimented on official runs submitted to the TREC 2016 Dynamic Domain (DD) Track.

[1]  Xiaozhong Liu,et al.  Automatic Feature Generation on Heterogeneous Graph for Music Recommendation , 2015, SIGIR.

[2]  Ryen W. White,et al.  Lessons from the journey: a query log analysis of within-session learning , 2014, WSDM.

[3]  Ben Carterette,et al.  An analysis of NP-completeness in novelty and diversity ranking , 2009, Information Retrieval.

[4]  Alistair Moffat,et al.  Score standardization for inter-collection comparison of retrieval systems , 2008, SIGIR '08.

[5]  Tetsuya Sakai,et al.  A Simple and Effective Approach to Score Standardisation , 2016, ICTIR.

[6]  Mihalis Yannakakis,et al.  Edge Dominating Sets in Graphs , 1980 .

[7]  Yannis Avrithis,et al.  Self-tuning Personalized Information Retrieval in an Ontology-Based Framework , 2005, OTM Workshops.

[8]  Shankar Kumar,et al.  Video suggestion and discovery for youtube: taking random walks through the view graph , 2008, WWW.

[9]  Grace Hui Yang,et al.  The water filling model and the cube test: multi-dimensional evaluation for professional search , 2013, CIKM.

[10]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[11]  Balachander Krishnamurthy,et al.  Measuring personalization of web search , 2013, WWW.

[12]  Gabriella Kazai,et al.  Measuring system performance and topic discernment using generalized adaptive-weight mean , 2009, CIKM.

[13]  Yiming Yang,et al.  Modeling Expected Utility of Multi-session Information Distillation , 2009, ICTIR.

[14]  Jure Leskovec,et al.  Supervised random walks: predicting and recommending links in social networks , 2010, WSDM '11.

[15]  Mark Sanderson,et al.  A review of factors influencing user satisfaction in information retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[16]  Tetsuya Sakai,et al.  The Relationship between Answer Ranking and User Satisfaction in a Question Answering System , 2005, NTCIR.

[17]  S. Robertson The probability ranking principle in IR , 1997 .

[18]  Kevyn Collins-Thompson,et al.  Query expansion using random walk models , 2005, CIKM '05.

[19]  J. J. Fourier,et al.  The Analytical Theory of Heat , 2009 .

[20]  Yifan Hu,et al.  Collaborative Filtering for Implicit Feedback Datasets , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[21]  Yehuda Koren,et al.  The Yahoo! Music Dataset and KDD-Cup '11 , 2012, KDD Cup.

[22]  Nick Craswell,et al.  Random walks on the click graph , 2007, SIGIR.

[23]  Filip Radlinski,et al.  Relevance and Effort: An Analysis of Document Utility , 2014, CIKM.

[24]  Alexander Pretschner,et al.  Ontology-based personalized search and browsing , 2003, Web Intell. Agent Syst..

[25]  Christina Lioma,et al.  Graph-based term weighting for information retrieval , 2011, Information Retrieval.

[26]  Stephen E. Robertson,et al.  On GMAP: and other transformations , 2006, CIKM '06.

[27]  Chun Chen,et al.  Music recommendation by unified hypergraph: combining social media information and music content , 2010, ACM Multimedia.

[28]  John D. Lafferty,et al.  Beyond independent relevance: methods and evaluation metrics for subtopic retrieval , 2003, SIGIR.

[29]  Charles L. A. Clarke,et al.  Time-based calibration of effectiveness measures , 2012, SIGIR '12.

[30]  Scott B. Huffman,et al.  How well does result relevance predict session satisfaction? , 2007, SIGIR.

[31]  Ben Carterette,et al.  Evaluating multi-query sessions , 2011, SIGIR.

[32]  Yasuhiro Fujiwara,et al.  Efficient ad-hoc search for personalized PageRank , 2013, SIGMOD '13.

[33]  Lois M. L. Delcambre,et al.  Discounted Cumulated Gain Based Evaluation of Multiple-Query IR Sessions , 2008, ECIR.

[34]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.