A simple and efficient sampling method for estimating AP and NDCG

We consider the problem of large scale retrieval evaluation. Recently two methods based on random sampling were proposed as a solution to the extensive effort required to judge tens of thousands of documents. While the first method proposed by Aslam et al. [1] is quite accurate and efficient, it is overly complex, making it difficult to be used by the community, and while the second method proposed by Yilmaz et al., infAP [14], is relatively simple, it is less efficient than the former since it employs uniform random sampling from the set of complete judgments. Further, none of these methods provide confidence intervals on the estimated values. The contribution of this paper is threefold: (1) we derive confidence intervals for infAP, (2) we extend infAP to incorporate nonrandom relevance judgments by employing stratified random sampling, hence combining the efficiency of stratification with the simplicity of random sampling, (3) we describe how this approach can be utilized to estimate nDCG from incomplete judgments. We validate the proposed methods using TREC data and demonstrate that these new methods can be used to incorporate nonrandom samples, as were available in TREC Terabyte track '06.