Investigating the Perceptual Validity of Evaluation Metrics for Automatic Piano Music Transcription

Automatic Music Transcription (AMT) is usually evaluated using low-level criteria, typically by counting the number of errors, with equal weighting. Yet, some errors (e.g. out-of-key notes) are more salient than others. In this study, we design an online listening test to gather judgements about AMT quality. These judgements take the form of pairwise comparisons of transcriptions of the same music by pairs of different AMT systems. We investigate how these judgements correlate with benchmark metrics, and find that although they match in many cases, agreement drops when comparing pairs with similar scores, or pairs of poor transcriptions. We show that onset-only notewise F-measure is the benchmark metric that correlates best with human judgement, all the more so with higher onset tolerance thresholds. We define a set of features related to various musical attributes, and use them to design a new metric that correlates significantly better with listeners’ quality judgements. We examine which musical aspects were important to raters by conducting an ablation study on the defined metric, highlighting the importance of the rhythmic dimension (tempo, meter). We make the collected data entirely available for further study, in particular to evaluate the perceptual relevance of new AMT metrics.

[1]  Gerhard Widmer,et al.  On the Potential of Simple Framewise Approaches to Piano Transcription , 2016, ISMIR.

[2]  Mauro Vallati,et al.  Symbolic Melodic Similarity: State of the Art and Future Challenges , 2016, Computer Music Journal.

[3]  Mark Steedman,et al.  Evaluating Automatic Polyphonic Music Transcription , 2018, ISMIR.

[4]  R. Baayen,et al.  Mixed-effects modeling with crossed random effects for subjects and items , 2008 .

[5]  Geraint A. Wiggins,et al.  Methodological Considerations in Studies of Musical Similarity , 2007, ISMIR.

[6]  Douglas Eck,et al.  Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset , 2018, ICLR.

[7]  R. Likert “Technique for the Measurement of Attitudes, A” , 2022, The SAGE Encyclopedia of Research Design.

[8]  Valentin Emiya,et al.  Perceptually-Based Evaluation of the Errors Usually Made When Automatically Transcribing Music , 2008, ISMIR.

[9]  David Sankoff,et al.  Comparison of musical sequences , 1990, Comput. Humanit..

[10]  Arthur Flexer,et al.  The Problem of Limited Inter-rater Agreement in Modelling Music Similarity , 2016, Journal of new music research.

[11]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[12]  Zhiyao Duan,et al.  A Metric for Music Notation Transcription Accuracy , 2017, ISMIR.

[13]  Rachel M. Bittner,et al.  Generalized Metrics for Single-f0 Estimation Evaluation , 2019, ISMIR.

[14]  Colin Raffel,et al.  Onsets and Frames: Dual-Objective Piano Transcription , 2017, ISMIR.

[15]  B. Gingras,et al.  Measuring the facets of musicality: The Goldsmiths Musical Sophistication Index (Gold-MSI) , 2014 .

[16]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[17]  Julien Allali,et al.  Polyphonic Alignment Algorithms for Symbolic Music Retrieval , 2009, CMMR/ICAD.

[18]  Emmanouil Benetos,et al.  Musical Features for Automatic Music Transcription Evaluation , 2020, ArXiv.

[19]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[20]  Yi-Hsuan Yang,et al.  Combining Spectral and Temporal Representations for Multipitch Estimation of Polyphonic Music , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Daniel P. W. Ellis,et al.  MIR_EVAL: A Transparent Implementation of Common MIR Metrics , 2014, ISMIR.

[22]  James D. Johnston,et al.  Transform coding of audio signals using perceptual noise criteria , 1988, IEEE J. Sel. Areas Commun..

[23]  Simon Dixon,et al.  Automatic Music Transcription: An Overview , 2019, IEEE Signal Processing Magazine.

[24]  Roland Badeau,et al.  Multipitch Estimation of Piano Sounds Using a New Probabilistic Spectral Smoothness Principle , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Mert Bay,et al.  Evaluation of Multiple-F0 Estimation and Tracking Systems , 2009, ISMIR.

[26]  Emilio Molina,et al.  Evaluation Framework for Automatic Singing Transcription , 2014, ISMIR.

[27]  Cláudio Rosito Jung,et al.  Audiovisual Tool for Solfège Assessment , 2016, ACM Trans. Multim. Comput. Commun. Appl..

[28]  Adrien Ycart,et al.  A-MAPS: Augmented MAPS Dataset with Rhythm and Key Annotations , 2018 .

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[31]  Daniel Müllensiefen,et al.  The Musicality of Non-Musicians: An Index for Assessing Musical Sophistication in the General Population , 2014, PloS one.

[32]  Simon Dixon,et al.  An Attack/Decay Model for Piano Transcription , 2016, ISMIR.