On Inter-rater Agreement in Audio Music Similarity

One of the central tasks in the annual MIREX evaluation campaign is the ”Audio Music Similarity and Retrieval (AMS)” task. Songs which are ranked as being highly similar by algorithms are evaluated by human graders as to how similar they are according to their subjective judgment. By analyzing results from the AMS tasks of the years 2006 to 2013 we demonstrate that: (i) due to low inter-rater agreement there exists an upper bound of performance in terms of subjective gradings; (ii) this upper bound has already been achieved by participating algorithms in 2009 and not been surpassed since then. Based on this sobering result we discuss ways to improve future evaluations of audio music similarity.

[1]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[2]  Andreas F. Ehmann,et al.  Human Similarity Judgments: Implications for the Design of Formal Evaluations , 2007, ISMIR.

[3]  Xavier Serra,et al.  Roadmap for Music Information ReSearch , 2013 .

[4]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[5]  J. Stephen Downie,et al.  The Music Information Retrieval Evaluation eXchange (MIREX) , 2006 .

[6]  Peter Knees,et al.  On Rhythm and General Music Similarity , 2009, ISMIR.

[7]  Markus Schedl,et al.  Minimal test collections for low-cost evaluation of Audio Music Similarity and Retrieval systems , 2012, International Journal of Multimedia Information Retrieval.

[8]  Elias Pampalk,et al.  Computational Models of Music Similarity and their Application in Music Information Retrieval , 2006 .

[9]  Arthur Flexer,et al.  Effects of Album and Artist Filters in Audio Similarity Computed for Very Large Music Databases , 2010, Computer Music Journal.

[10]  Arthur Flexer,et al.  A MIREX Meta-analysis of Hubness in Audio Music Similarity , 2012, ISMIR.

[11]  Markus Schedl,et al.  Local and global scaling reduce hubs in space , 2012, J. Mach. Learn. Res..

[12]  Mert Bay,et al.  The Music Information Retrieval Evaluation eXchange: Some Observations and Insights , 2010, Advances in Music Information Retrieval.

[13]  Kris West Novel techniques for audio music classification and search , 2008, ACMMR.

[14]  Arthur Flexer,et al.  Identification of perceptual qualities in textural sounds using the repertory grid method , 2011, AM '11.

[15]  Fabio Vignoli,et al.  Digital Music Interaction Concepts: A User Study , 2004, ISMIR.

[16]  J. Stephen Downie,et al.  How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval , 2012, ISMIR.

[17]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[18]  Markus Schedl,et al.  The neglected user in music information retrieval research , 2013, Journal of Intelligent Information Systems.

[19]  Martin F. McKinney,et al.  Perceptual evaluation of music similarity , 2006, ISMIR.