Assessing the significance of performance differences on the PASCAL VOC challenges via bootstrapping

In the PASCAL VOC challenges, entrants in a particular competition are evaluated in terms of a specified metric. It can happen that some entrants will have similar scores, and it is of interest to assess the significance of these differences. For example, we might be interested to know if the highest-scoring entry is significantly better than some of the others. In this note we discuss the use of bootstrap sampling to address this question. We first came across the idea of bootstrapping precisionrecall curves in the blog comment by O’Connor (2010), although bootstrapping of ROC curves has been discussed by many authors, e.g. Hall et al (2004); Bertail et al (2009). In the bootstrap (see e.g. Wasserman, 2004, Ch. 8), the data points (images in our case) are sampled with replacement from the original n test points to produce B bootstrap replicates. To compare two methods A and