论文信息 - A framework for evaluation and optimization of relevance and novelty-based retrieval

A framework for evaluation and optimization of relevance and novelty-based retrieval

There has been growing interest in building and optimizing retrieval systems with respect to relevance and novelty of information, which together more realistically reflect the usefulness of a system as perceived by the user. How to combine these criteria into a single metric that can be used to measure as well as optimize retrieval systems is an open challenge that has only received partial solutions so far. Unlike relevance, which can be measured independently for each document, the novelty of a document depends on other documents seen by the user during his or her past interaction with the system. This is especially problematic for assessing the retrieval performance across multiple ranked lists, as well as for learning from user's feedback, which must be interpreted with respect to other documents seen by the user. Moreover, users often have different tolerances towards redundancy depending on the nature of their information needs and available time, but this factor is not explicitly modeled by existing approaches for novelty-based retrieval. In this thesis, we develop a new framework for evaluating as well as optimizing retrieval systems with respect to their utility, which is measured in terms of relevance and novelty of information. We combine a nugget-based model of utility with a probabilistic model of user behavior; this leads to a flexible metric that generalizes existing evaluation measures. We demonstrate that our framework naturally extends to the evaluation of session-based retrieval while maintaining a consistent definition of novelty across multiple ranked lists. Next, we address the complementary problem of optimization, i.e., how to maximize retrieval performance for one or more ranked lists with respect to the proposed measure. Since the system does not have knowledge of the nuggets that are relevant to each query, we propose a ranking approach based on the use of observable query and document features ( e.g., words and named entities) as surrogates for the unknown nuggets, whose weights are automatically learned from user feedback. However, finding the ranked list that maximizes the coverage of a given set of nuggets leads to an NP-hard problem. We take advantage of the sub-modularity of the proposed measure to derive lower bounds on the performance of approximate algorithms, and also conduct experiments to assess the empirical performance of a greedy algorithm under various conditions. Our framework provides a strong foundation for modeling retrieval performance in terms of non-independent utility of documents across multiple ranked lists. Moreover, it allows accurate evaluation and optimization of retrieval systems under realistic conditions, and hence, enable rapid development and tuning of new algorithms for novelty-based retrieval without the need for user-centric evaluations involving human subjects, which, although more realistic, are expensive, time-consuming, and risky in a live environment.

Yiming Yang | Abhimanyu Lad | Yiming Yang | A. Lad

[1] Nicholas J. Belkin,et al. Ranking in Principle , 1978, J. Documentation.

[2] Gökhan BakIr,et al. Predicting Structured Data , 2008 .

[3] Wei Li,et al. A Question Answering System Supported by Information Extraction , 2000, ANLP.

[4] Gökhan Tür,et al. Exploiting information extraction annotations for document retrieval in distillation tasks , 2007, INTERSPEECH.

[5] Kalervo Järvelin,et al. Evaluating the effectiveness of relevance feedback based on a user simulation model: effects of a user scenario on cumulated gain value , 2008, Information Retrieval.

[6] Gregory N. Hullender,et al. Learning to rank using gradient descent , 2005, ICML.

[7] Jaime Carbonell,et al. Multi-Document Summarization By Sentence Extraction , 2000 .

[8] Maarten de Rijke,et al. Topical Diversity and Relevance Feedback , 2009, TREC.

[9] Jade Goldstein-Stewart,et al. Creating and evaluating multi-document sentence extract summaries , 2000, CIKM '00.

[10] Jonathan G. Fiscus,et al. Topic detection and tracking evaluation overview , 2002 .

[11] Amanda Spink,et al. A study of multitasking Web search , 2003, Proceedings ITCC 2003. International Conference on Information Technology: Coding and Computing.

[12] Stephen E. Robertson,et al. Modelling A User Population for Designing Information Retrieval Metrics , 2008, EVIA@NTCIR.

[13] Yiming Yang,et al. Generalizing from relevance feedback using named entity wildcards , 2007, CIKM '07.

[14] Stephen E. Robertson,et al. A new interpretation of average precision , 2008, SIGIR '08.

[15] Andreas Krause,et al. Cost-effective outbreak detection in networks , 2007, KDD '07.

[16] Ellen M. Voorhees,et al. Overview of the TREC 2004 Novelty Track. , 2005 .

[17] ChengXiang Zhai,et al. Risk minimization and language modeling in text retrieval dissertation abstract , 2002, SIGF.

[18] James Allan,et al. Text classification and named entities for new event detection , 2004, SIGIR '04.

[19] Filip Radlinski,et al. How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[20] Olivier Chapelle,et al. Expected reciprocal rank for graded relevance , 2009, CIKM.

[21] Judy Bateman. Changes in Relevance Criteria: A Longitudinal Study. , 1998 .

[22] Christopher D. Manning,et al. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[23] Alistair Moffat,et al. Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.