Optimizing the cost of information retrieval testcollections

We consider the problem of optimally allocating limited resources to construct relevance judgements for a test collection that facilities reliable evaluation of retrieval systems. We assume that there is a large set of test queries, for each of which a large number of documents need to be judged though the available budget only permits to judge a subset of them. A candidate solution to this problem has to deal with, at least, three challenges. (i) Given a fixed budget it has to efficiently select a subset of query-documents pairs for acquiring relevance judgements. (ii) With collected relevance judgements it has to be able to not only accurately evaluate a set of systems participating in a test collection construction but also reliably assess the performance of new as yet unseen systems. (iii) Finally, it has to properly deal with uncertainty that is due to (a) the presence of unjudged documents in a rank list, (b) the presence of queries with no relevance judgements, and (c) errors caused by human assessors when labelling documents. In this thesis we propose an optimisation framework that accommodates appropriate solutions for each of the three challenges. Our approach is aimed to be of benefit to construct IR test collections by research institutes, e.g. NIST, or commercial search engines, e.g. Google and Bing, where there are large scale documents collections and loads of query logs however economic constraints prohibit gathering comprehensive relevance judgements.