Minimal test collections for low-cost evaluation of Audio Music Similarity and Retrieval systems

Reliable evaluation of Information Retrieval systems requires large amounts of relevance judgments. Making these annotations is not only tedious but also complex for many Music Information Retrieval tasks. As a result, performing such evaluations usually requires too much effort. A low-cost alternative is the application of Minimal Test Collections algorithms, which offer very reliable results while significantly reducing the required annotation effort. The idea is to represent effectiveness scores as random variables that can be estimated, iteratively selecting which documents to judge so that we can compute accurate estimates with a certain degree of confidence and with the least effort. In this paper we show the application of Minimal Test Collections to the evaluation of the Audio Music Similarity and Retrieval task, run by the annual MIREX evaluation campaign. An analysis with the MIREX 2007, 2009, 2010 and 2011 data shows that with as little as 2 % of the total judgments we can obtain accurate estimates of the ranking of systems. We also present a method to rank systems without making any annotations, which can be successfully used when little or no resources are available.

[1]  J. S. Long,et al.  Regression Models for Categorical and Limited Dependent Variables , 1997 .

[2]  J. Stephen Downie,et al.  The Scientific Evaluation of Music Information Retrieval Systems: Foundations and Future , 2004, Computer Music Journal.

[3]  J. S. Downie The MIR/MDL Evaluation Project White Paper Collection , 2002 .

[4]  T. Yee The VGAM Package for Categorical Data Analysis , 2010 .

[5]  Julián Urbano Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain , 2011, ISMIR.

[6]  Ben Carterette,et al.  Million Query Track 2007 Overview , 2008, TREC.

[7]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[8]  Mónica Marrero,et al.  Audio Music Similarity and Retrieval: Evaluation Power and Stability , 2011, ISMIR.

[9]  Ian Soboroff,et al.  Ranking retrieval systems without relevance judgments , 2001, SIGIR '01.

[10]  Ellen M. Voorhees,et al.  The Philosophy of Information Retrieval Evaluation , 2001, CLEF.

[11]  Donna Harman,et al.  Information Retrieval Evaluation , 2011, Synthesis Lectures on Information Concepts, Retrieval, and Services.

[12]  James Allan,et al.  Minimal test collections for retrieval evaluation , 2006, SIGIR.

[13]  Markus Schedl,et al.  Towards minimal test collections for evaluation of audio music similarity and retrieval , 2012, WWW.

[14]  Mert Bay,et al.  The Music Information Retrieval Evaluation eXchange: Some Observations and Insights , 2010, Advances in Music Information Retrieval.

[15]  Ben Carterette,et al.  Robust test collections for retrieval evaluation , 2007, SIGIR.

[16]  Ben Carterette,et al.  Low-cost and robust evaluation of information retrieval systems , 2008, SIGF.

[17]  A. Agresti,et al.  The analysis of ordered categorical data: An overview and a survey of recent developments , 2005 .

[18]  Arthur Flexer,et al.  Effects of Album and Artist Filters in Audio Similarity Computed for Very Large Music Databases , 2010, Computer Music Journal.

[19]  J. Stephen Downie,et al.  How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval , 2012, ISMIR.

[20]  Andreas F. Ehmann,et al.  Human Similarity Judgments: Implications for the Design of Formal Evaluations , 2007, ISMIR.

[21]  Gobinda G. Chowdhury,et al.  TREC: Experiment and Evaluation in Information Retrieval , 2007 .

[22]  Ellen M. Voorhees Variations in relevance judgments and the measurement of retrieval effectiveness , 2000, Inf. Process. Manag..

[23]  C. Wild,et al.  Vector Generalized Additive Models , 1996 .

[24]  J. A. Calvin Regression Models for Categorical and Limited Dependent Variables , 1998 .

[25]  J. Stephen Downie,et al.  Workshop on the creation of standardized test collections, tasks and metrics for music information retrieval (MIR) and music digital library (MDL) evaluation , 2002, JCDL '02.

[26]  E. Voorhees Whither Music IR Evaluation Infrastructure : Lessons to be Learned from TREC , 2002 .

[27]  Ben Carterette,et al.  Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks , 2007, NIPS.

[28]  Julián Urbano,et al.  Current Challenges in the Evaluation of Predominant Melody Extraction Algorithms , 2012, ISMIR.