A framework for evaluation and optimization of relevance and novelty-based retrieval

There has been growing interest in building and optimizing retrieval systems with respect to relevance and novelty of information, which together more realistically reflect the usefulness of a system as perceived by the user. How to combine these criteria into a single metric that can be used to measure as well as optimize retrieval systems is an open challenge that has only received partial solutions so far. Unlike relevance, which can be measured independently for each document, the novelty of a document depends on other documents seen by the user during his or her past interaction with the system. This is especially problematic for assessing the retrieval performance across multiple ranked lists, as well as for learning from user's feedback, which must be interpreted with respect to other documents seen by the user. Moreover, users often have different tolerances towards redundancy depending on the nature of their information needs and available time, but this factor is not explicitly modeled by existing approaches for novelty-based retrieval. In this thesis, we develop a new framework for evaluating as well as optimizing retrieval systems with respect to their utility, which is measured in terms of relevance and novelty of information. We combine a nugget-based model of utility with a probabilistic model of user behavior; this leads to a flexible metric that generalizes existing evaluation measures. We demonstrate that our framework naturally extends to the evaluation of session-based retrieval while maintaining a consistent definition of novelty across multiple ranked lists. Next, we address the complementary problem of optimization, i.e., how to maximize retrieval performance for one or more ranked lists with respect to the proposed measure. Since the system does not have knowledge of the nuggets that are relevant to each query, we propose a ranking approach based on the use of observable query and document features ( e.g., words and named entities) as surrogates for the unknown nuggets, whose weights are automatically learned from user feedback. However, finding the ranked list that maximizes the coverage of a given set of nuggets leads to an NP-hard problem. We take advantage of the sub-modularity of the proposed measure to derive lower bounds on the performance of approximate algorithms, and also conduct experiments to assess the empirical performance of a greedy algorithm under various conditions. Our framework provides a strong foundation for modeling retrieval performance in terms of non-independent utility of documents across multiple ranked lists. Moreover, it allows accurate evaluation and optimization of retrieval systems under realistic conditions, and hence, enable rapid development and tuning of new algorithms for novelty-based retrieval without the need for user-centric evaluations involving human subjects, which, although more realistic, are expensive, time-consuming, and risky in a live environment.

[1]  Nicholas J. Belkin,et al.  Ranking in Principle , 1978, J. Documentation.

[2]  Gökhan BakIr,et al.  Predicting Structured Data , 2008 .

[3]  Wei Li,et al.  A Question Answering System Supported by Information Extraction , 2000, ANLP.

[4]  Gökhan Tür,et al.  Exploiting information extraction annotations for document retrieval in distillation tasks , 2007, INTERSPEECH.

[5]  Kalervo Järvelin,et al.  Evaluating the effectiveness of relevance feedback based on a user simulation model: effects of a user scenario on cumulated gain value , 2008, Information Retrieval.

[6]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[7]  Jaime Carbonell,et al.  Multi-Document Summarization By Sentence Extraction , 2000 .

[8]  Maarten de Rijke,et al.  Topical Diversity and Relevance Feedback , 2009, TREC.

[9]  Jade Goldstein-Stewart,et al.  Creating and evaluating multi-document sentence extract summaries , 2000, CIKM '00.

[10]  Jonathan G. Fiscus,et al.  Topic detection and tracking evaluation overview , 2002 .

[11]  Amanda Spink,et al.  A study of multitasking Web search , 2003, Proceedings ITCC 2003. International Conference on Information Technology: Coding and Computing.

[12]  Stephen E. Robertson,et al.  Modelling A User Population for Designing Information Retrieval Metrics , 2008, EVIA@NTCIR.

[13]  Yiming Yang,et al.  Generalizing from relevance feedback using named entity wildcards , 2007, CIKM '07.

[14]  Stephen E. Robertson,et al.  A new interpretation of average precision , 2008, SIGIR '08.

[15]  Andreas Krause,et al.  Cost-effective outbreak detection in networks , 2007, KDD '07.

[16]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Novelty Track. , 2005 .

[17]  ChengXiang Zhai,et al.  Risk minimization and language modeling in text retrieval dissertation abstract , 2002, SIGF.

[18]  James Allan,et al.  Text classification and named entities for new event detection , 2004, SIGIR '04.

[19]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[20]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[21]  Judy Bateman Changes in Relevance Criteria: A Longitudinal Study. , 1998 .

[22]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[23]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[24]  Allen Kent,et al.  Machine literature searching II. Problems in indexing for machine searching , 1954 .

[25]  G. Nemhauser,et al.  Exceptional Paper—Location of Bank Accounts to Optimize Float: An Analytic Study of Exact and Approximate Algorithms , 1977 .

[26]  Richard L. Church,et al.  The maximal covering location problem , 1974 .

[27]  Ben Carterette,et al.  An analysis of NP-completeness in novelty and diversity ranking , 2009, Information Retrieval.

[28]  Annalina Caputo,et al.  Boosting a Semantic Search Engine by Named Entities , 2009, ISMIS.

[29]  John D. Lafferty,et al.  A risk minimization framework for information retrieval , 2006, Inf. Process. Manag..

[30]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[31]  Nick Craswell,et al.  An experimental comparison of click position-bias models , 2008, WSDM '08.

[32]  Hui Lin,et al.  Multi-document Summarization via Budgeted Maximization of Submodular Functions , 2010, NAACL.

[33]  Jimmy J. Lin,et al.  Automatically Evaluating Answers to Definition Questions , 2005, HLT.

[34]  T. Minka A comparison of numerical optimizers for logistic regression , 2004 .

[35]  Yiming Yang,et al.  Modeling Expected Utility of Multi-session Information Distillation , 2009, ICTIR.

[36]  Michael Gamon,et al.  The PYTHY Summarization System: Microsoft Research at DUC 2007 , 2007 .

[37]  James Allan,et al.  If I Had a Million Queries , 2009, ECIR.

[38]  W. S. Cooper Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems , 1968 .

[39]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[40]  Thorsten Joachims,et al.  Evaluating Retrieval Performance Using Clickthrough Data , 2003, Text Mining.

[41]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[42]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[43]  Sreenivas Gollapudi,et al.  Diversifying search results , 2009, WSDM '09.

[44]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[45]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[46]  Yiming Yang,et al.  Learning to rank relevant and novel documents through user feedback , 2010, CIKM.

[47]  Volker Tresp Proceedings of the NIPS 2005 Workshop on Learning to Rank , 2005 .

[48]  Masatoshi Yoshikawa,et al.  A Ranking Scheme for XML Information Retrieval Based on Benefit and Reading Effort , 2007, ICADL.

[49]  Filip Radlinski,et al.  Redundancy, diversity and interdependent document relevance , 2009, SIGF.

[50]  Jaana Kekäläinen,et al.  Expected reading effort in focused retrieval evaluation , 2010, Information Retrieval.

[51]  Stefano Mizzaro Relevance: the whole history , 1997 .

[52]  Thorsten Joachims,et al.  Predicting diverse subsets using structural SVMs , 2008, ICML '08.

[53]  L. Wolsey Maximising Real-Valued Submodular Functions: Primal and Dual Heuristics for Location Problems , 1982, Math. Oper. Res..

[54]  Filip Radlinski,et al.  Learning diverse rankings with multi-armed bandits , 2008, ICML '08.

[55]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[56]  James Allan,et al.  Incremental relevance feedback for information filtering , 1996, SIGIR '96.

[57]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[58]  Hui Lin,et al.  Graph-based submodular selection for extractive summarization , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[59]  Amanda Spink,et al.  Multitasking information seeking and searching processes , 2002, J. Assoc. Inf. Sci. Technol..

[60]  Yiming Yang,et al.  CMU Report on TDT-2: Segmentation, Detection and Tracking , 1999 .

[61]  Vijay V. Raghavan,et al.  A critical investigation of recall and precision as measures of retrieval system performance , 1989, TOIS.

[62]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[63]  Olivier Chapelle,et al.  A dynamic bayesian network click model for web search ranking , 2009, WWW '09.

[64]  Stephen E. Fienberg,et al.  Testing Statistical Hypotheses , 2005 .

[65]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[66]  Inderjeet Mani,et al.  The Tipster Summac Text Summarization Evaluation , 1999, EACL.

[67]  Dafna Shahaf,et al.  Turning down the noise in the blogosphere , 2009, KDD.

[68]  Mark Sanderson,et al.  Do user preferences and evaluation measures line up? , 2010, SIGIR.

[69]  Ellen M. Voorhees,et al.  Retrieval System Evaluation , 2005 .

[70]  Thorsten Joachims,et al.  Accurately interpreting clickthrough data as implicit feedback , 2005, SIGIR '05.

[71]  Alexey Radul,et al.  Nuggeteer: Automatic Nugget-Based Evaluation using Descriptions and Judgements , 2006, NAACL.

[72]  Lois M. L. Delcambre,et al.  Discounted Cumulated Gain Based Evaluation of Multiple-Query IR Sessions , 2008, ECIR.

[73]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[74]  C. Cleverdon On the Inverse Relationship of Recall and Precision. , 1972 .

[75]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[76]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[77]  Ashwin Ram,et al.  A Theory of Reading , 1994, AAAI.

[78]  Mark Sanderson,et al.  The relationship between IR effectiveness measures and user satisfaction , 2007, SIGIR.

[79]  Stephen E. Robertson,et al.  The TREC-8 Filtering Track Final Report , 1999, TREC.

[80]  Leonhard Hennig,et al.  Topic-based Multi-Document Summarization with Probabilistic Latent Semantic Analysis , 2009, RANLP.

[81]  Ciya Liao,et al.  A model to estimate intrinsic document relevance from the clickthrough logs of a web search engine , 2010, WSDM '10.

[82]  Howard Greisdorf,et al.  Relevance thresholds: a multi-stage predictive model of how users evaluate information , 2003, Inf. Process. Manag..

[83]  Ian Soboroff,et al.  Overview of the TREC 2004 Novelty Track , 2004, TREC.

[84]  J. V. White,et al.  Statistical Evaluation of Information Distillation Systems , 2008, LREC.

[85]  James Allan,et al.  Retrieval and novelty detection at the sentence level , 2003, SIGIR.

[86]  Yiming Yang,et al.  Topic-conditioned novelty detection , 2002, KDD.

[87]  S. Robertson The probability ranking principle in IR , 1997 .

[88]  James P. Callan Learning while filtering documents , 1998, SIGIR '98.

[89]  Gobinda G. Chowdhury,et al.  TREC: Experiment and Evaluation in Information Retrieval , 2007 .

[90]  Amanda Spink,et al.  Multitasking during Web search sessions , 2006, Inf. Process. Manag..

[91]  Andreas Krause,et al.  Near-optimal Nonmyopic Value of Information in Graphical Models , 2005, UAI.

[92]  Daniel E. Rose,et al.  Understanding user goals in web search , 2004, WWW '04.

[93]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[94]  Marcia J. Bates,et al.  The design of browsing and berrypicking techniques for the online search interface , 1989 .

[95]  Yiming Yang,et al.  Utility-based information distillation over temporally sequenced documents , 2007, SIGIR.

[96]  ChengXiang Zhai,et al.  Probabilistic Relevance Models Based on Document and Query Generation , 2003 .

[97]  Susan T. Dumais,et al.  Improving Web Search Ranking by Incorporating User Behavior Information , 2019, SIGIR Forum.

[98]  Paul Over,et al.  TREC-7 Interactive Track Report , 1998, TREC.

[99]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[100]  Sreenivas Gollapudi,et al.  An axiomatic approach for result diversification , 2009, WWW '09.

[101]  Tetsuya Sakai Ranking the NTCIR Systems Based on Multigrade Relevance , 2004, AIRS.

[102]  Yiqun Liu,et al.  THUIR at TREC 2009 Web Track: Finding Relevant and Diverse Results for Large Scale Web Search , 2009, TREC.

[103]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[104]  Yi Zhang,et al.  Novelty and redundancy detection in adaptive filtering , 2002, SIGIR '02.

[105]  E R Brown,et al.  A theory of reading. , 1981, Journal of communication disorders.

[106]  G. Nemhauser,et al.  On the Uncapacitated Location Problem , 1977 .

[107]  Shuming Shi,et al.  Microsoft Research Asia at the Web Track of TREC 2009 , 2009, TREC.