Approximate Recall Confidence Intervals

Recall, the proportion of relevant documents retrieved, is an important measure of effectiveness in information retrieval, particularly in the legal, patent, and medical domains. Where document sets are too large for exhaustive relevance assessment, recall can be estimated by assessing a random sample of documents, but an indication of the reliability of this estimate is also required. In this article, we examine several methods for estimating two-tailed recall confidence intervals. We find that the normal approximation in current use provides poor coverage in many circumstances, even when adjusted to correct its inappropriate symmetry. Analytic and Bayesian methods based on the ratio of binomials are generally more accurate but are inaccurate on small populations. The method we recommend derives beta-binomial posteriors on retrieved and unretrieved yield, with fixed hyperparameters, and a Monte Carlo estimate of the posterior distribution of recall. We demonstrate that this method gives mean coverage at or near the nominal level, across several scenarios, while being balanced and stable. We offer advice on sampling design, including the allocation of assessments to the retrieved and unretrieved segments, and compare the proposed beta-binomial with the officially reported normal intervals for recent TREC Legal Track iterations.

[1]  E. S. Pearson,et al.  THE USE OF CONFIDENCE OR FIDUCIAL LIMITS ILLUSTRATED IN THE CASE OF THE BINOMIAL , 1934 .

[2]  P. A. R. Koopman,et al.  Confidence intervals for the ratio of two binomial proportions , 1984 .

[3]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[4]  Yan K. Liu,et al.  Evaluating alternative one-sided coverage intervals for a proportion , 2009 .

[5]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[6]  K. Koch Introduction to Bayesian Statistics , 2007 .

[7]  Russell C. H. Cheng Generating beta variates with nonintegral shape parameters , 1978, CACM.

[8]  Ming-Hui Chen,et al.  Monte Carlo Estimation of Bayesian Credible and HPD Intervals , 1999 .

[9]  J. Aslam,et al.  A Practical Sampling Strategy for Efficient Retrieval Evaluation , 2007 .

[10]  Emine Yilmaz,et al.  A statistical method for system evaluation using incomplete judgments , 2006, SIGIR.

[11]  D. Dyer,et al.  On the choice of the prior distribution in hypergeometric sampling , 1993 .

[12]  L. Brown,et al.  Interval Estimation for a Binomial Proportion , 2001 .

[13]  Ben Carterette,et al.  Robust test collections for retrieval evaluation , 2007, SIGIR.

[14]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[15]  Stephen T. Buckland,et al.  Monte Carlo confidence intervals , 1984 .

[16]  H. Hartley,et al.  Unbiased Ratio Estimators , 1954, Nature.

[17]  Peter Hall,et al.  Improving the normal approximation when constructing one-sided confidence intervals for binomial or Poisson parameters , 1982 .

[18]  Douglas W. Oard,et al.  Overview of the TREC 2008 Legal Track , 2008, TREC.

[19]  Herbert L. Roitblat,et al.  Document categorization in legal electronic discovery: computer classification vs. manual review , 2010, J. Assoc. Inf. Sci. Technol..

[20]  Don Allen Normal Approximation To The Binomial Distribution , 2011 .

[21]  A. Brix Bayesian Data Analysis, 2nd edn , 2005 .

[22]  Ulukbek Ibraev,et al.  Estimating the Number of Relevant Documents in Enormous Collections. , 1999 .

[23]  Alan Agresti,et al.  Simple and Effective Confidence Intervals for Proportions and Differences of Proportions Result from Adding Two Successes and Two Failures , 2000 .

[24]  P. Chiou,et al.  An information-theoretic approach to incorporating prior information in binomial sampling , 1984 .

[25]  Douglas W. Oard,et al.  Overview of the TREC 2009 Legal Track , 2009, TREC.

[26]  Jacques Dutka The early history of the hypergeometric function , 1984 .

[27]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[28]  W. L. Nicholson,et al.  On the Normal Approximation to the Hypergeometric Distribution , 1956 .

[29]  A. Agresti,et al.  Approximate is Better than “Exact” for Interval Estimation of Binomial Proportions , 1998 .

[30]  R. Newcombe Logit Confidence Intervals and the Inverse Sinh Transformation , 2001 .

[31]  G. Newman,et al.  CONFIDENCE INTERVALS , 1987, The Lancet.

[32]  Douglas W. Oard,et al.  Assessor error in stratified evaluation , 2010, CIKM.

[33]  T. Cai,et al.  One-sided confidence intervals in discrete distributions , 2005 .

[34]  James O. Berger,et al.  Objective Priors for Discrete Parameter Spaces , 2012 .

[35]  P. Hall The Bootstrap and Edgeworth Expansion , 1992 .

[36]  G. Samsa,et al.  Likelihood ratios with confidence: sample size estimation for diagnostic test studies. , 1991, Journal of clinical epidemiology.

[37]  A. B. Sunter,et al.  List Sequential Sampling with Equal or Unequal Probabilities without Replacement , 1977 .

[38]  William C. Guenther Unbiased Confidence Intervals , 1971 .

[39]  Virgil Pavlu,et al.  Large Scale IR Evaluation. , 2008 .

[40]  J. Neyman,et al.  On the Problem of Confidence Intervals , 1935 .

[41]  Douglas W. Oard,et al.  Overview of the TREC 2007 Legal Track , 2007, TREC.

[42]  Carl-Erik Särndal,et al.  Model Assisted Survey Sampling , 1997 .