A Measure for Evaluating Retrieval Techniques based on Partially Ordered Ground Truth Lists

For the RISM A/II collection of musical incipit (short extracts of scores, taken from the beginning), we have established a ground truth based on the opinions of human experts. It contains correctly ranked matches for a set of given queries. These ranked lists contain groups of documents whose ranks were not significantly different. In other words, they are only partially ordered. To make use of the available information for measuring the quality of retrieval results, we introduce the "average dynamic recall" (ADR) that averages the recall among a dynamic set of relevant documents, taking into account the fact that the ground truth reliably orders groups of matches, but not always individual matches. Dynamic recall measures how many of the documents that should have appeared before or at a given position in the result list actually have appeared. ADR at a given position averages this measure up to the given position. Our measure was first used at the MIREX 2005 Symbolic Melodic Similarity contest