Evaluating Optical Music Recognition (OMR) has long been an acknowledged sore spot of the field. This short position paper attempts to bring some clarity to what are actually open problems in OMR evaluation: a closer look reveals that the main problem is finding an edit distance between some practical representations of music scores. While estimating these editing costs in the transcription use-case of OMR is difficult, I argue that the problems with modeling the subsequent editing workflow can be de-coupled from general OMR system development using an intrinsic evaluation approach, and sketch out how to do this. I. WE NEED A MUSIC SCORE EDIT DISTANCE Optical Music Recognition (OMR) has a known problem with evaluation [1]–[3]. We can approach OMR evaluation from two angles: extrinsic and intrinsic. By extrinsic, we mean evaluation in application contexts: how well does an OMR system address a specific need (such as retrieval, transcription, playback, ...)? Intrinsic evaluation asks a different question: how much of the information encoded by the music score has a given OMR system recovered? An example of extrinsic OMR evaluation can be found, e.g., in [4], where OMR is evaluated in the context of a cross-modal retrieval system; (partial) intrinsic evaluation is done i.a. in [5], where pitches and durations of recognized notes are counted against ground truth data. In this short position paper, I assess what the outstanding problems in evaluating OMR are, and propose intrinsic evaluation as a sensible way forward for OMR research. The major problem in OMR evaluation is that given a ground truth encoding of a score and the output of a recognition system, there is no automatic method capable of reliably computing how well the recognition system performs that would (1) be rigorously described and evaluated, (2) have a public implementation, (3) give meaningful results. Other applications such as retrieval or extracing MIDI can be evaluated using more general methodologies. E.g., when using OMR to retrieve music scores, there is little domain-specific to defining success compared to retrieving other documents; any time MIDI output is required, metrics used to evaluate multi-f0 estimation can be adapted; score following has welldefined evaluation metrics at different levels of granularity as well. Within the traditional OMR pipeline [6], the partial steps (such as smbol detection) also can use more general evaluation metrics. However, when OMR is applied to retypesetting music (which is arguably its original motivation), no evaluation metric is available. In fact, computing an “edit distance” between a ground truth representation of a full music score and OMR output may be the only evaluation scenario where satisfactory measures are not available. The notion of “edit cost” [7] or “recognition gain” [8] that defines success in terms of how much time a human editor saves by using an OMR system is yet more problematic, as it depends on the specific toolchain used. What can be done? One can try and implement such a metric. However, because cost-to-correct depends on the toolchains music editors use to work with OMR outputs, developing extrinsic evaluation metrics of OMR for transcription would require user studies at a scale which is not feasible for the few active OMR researchers. For these reasons, we argue it would be helpful for OMR development to have an intrinsic evaluation metric. After all, why address individual concerns that OMR users may have when full-pipeline OMR does have the potential to address all the application scenarios of OMR, as it attempts to extract all the information available from a music score? II. MUSIC NOTATION FORMATS ARE PROBLEMATIC A part of the edit distance problem lies in the ways music notation is stored digitally. MusicXML or MEI, which represent current best practices in open-source formats of digital representation of music scores, have some properties that make it difficult to compute a useful edit distance between two such files (useful in the sense that it would meausre either the amount of errors that an OMR system made, or the actual difficulty of changing one score to the other). Furthermore, the formats can encode the same score in multiple ways – e.g., MusicXML stores scores either measure-wise, or voice-wise. Next, both formats are designed top-down, as trees that represent in their nodes both abstract concepts like a voice or note and graphical entities such as stems or beams. This implies that they cannot represent partial recognition results, and cannot encode syntactically incorrect notation. Furthermore, while the hierarchical structure mostly reflects the abstract structures of music such as voices and measures, it does not reflect the structure of music notation: local changes in the score can lead to several changes in the encoding that occur far apart, and vice versa. This is an inherent limitation of their tree structure. The LilyPond format is impractical for anything but attempts at end-to-end OMR, as it hides much of the graphical representation in its engraving engine, and has so many ways of representing the same music that it is hard to meaningfully compare LilyPond files. The MuNG format [3] does to some extent overcome this locality problem by assuming a directed acyclic graph instead of a tree stucture, but it is limited to OMR ground truth and lacks conversions to other formats than MIDI. The lesson here is that one should not bind intrinsic OMR evaluation to specific notation formats. After all, these formats change much faster than music notation itself. Rather, an evaluation metric should focus on inherent properties of music notation. III. ARGUING FOR INTRINSIC EVALUATION Intrinsic evaluation of OMR systems means to answer the question “How good is this system?” without having to add, “for this specific purpose?” – thus de-coupling research of OMR methods from their individual use-cases, including the problematic score transcription. After all, music notation is the same regardless of whether it is being recognized for the purpose of searching a database or for producing a digital edition of the score. There is no reason why this should not be possible: there is a finite amount of information that a music document carries, which can be exhaustively enumerated. It follows that we should be able to measure what proportion of this information our systems recover correctly. The benefit of intrinsic evaluation would be shedding the burden of accounting for score editing toolchains, independence on problematic music notation formats used in broad practice, and a clearly interpretable automatic metric for guiding OMR development (and potentially usable as a differentiable loss function for training full-pipeline end-to-end machine learning-based systems).
[1]
Mariusz Szwoch,et al.
Using MusicXML to Evaluate Accuracy of OMR Systems
,
2008,
Diagrams.
[2]
Jaroslav Pokorný,et al.
Further Steps Towards a Standard Testbed for Optical Music Recognition
,
2016,
ISMIR.
[3]
Meinard Müller,et al.
Matching Musical Themes based on noisy OCR and OMR input
,
2015,
2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[4]
Pierfrancesco Bellini,et al.
Assessing Optical Music Recognition Tools
,
2007,
Computer Music Journal.
[5]
Pavel Pecina,et al.
The MUSCIMA++ Dataset for Handwritten Optical Music Recognition
,
2017,
2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).
[6]
Jakob Grue Simonsen,et al.
Towards a Standard Testbed for Optical Music Recognition: Definitions, Metrics, and Page Images
,
2015
.
[7]
Carlos Guedes,et al.
Optical music recognition: state-of-the-art and open issues
,
2012,
International Journal of Multimedia Information Retrieval.
[8]
Kia Ng,et al.
Improving OMR for Digital Music Libraries with Multiple Recognisers and Multiple Sources
,
2014,
DLfM '14.
[9]
Jürgen Schmidhuber,et al.
DeepScores-A Dataset for Segmentation, Detection and Classification of Tiny Objects
,
2018,
2018 24th International Conference on Pattern Recognition (ICPR).