Human-Directed Optical Music Recognition

We propose a human-in-the-loop scheme for optical music recognition. Starting from the results of our recognition engine, we pose the problem as one of constrained optimization, in which the human can specify various pixel labels, while our recognition engine seeks an optimal explanation subject to the humansupplied constraints. In this way we enable an interactive approach with a uniform communication channel from human to machine where both iterate their roles until the desired end is achieved. Pixel constraints may be added to various stages, including staff finding, system identification, and measure recognition. Results on a test show significant speed up when compared to purely human-driven correction. Introduction Optical Music Recognition (OMR) holds potential to transform score images into symbolic music libraries, thus enabling search, categorization, and retrieval by symbolic content, as we now take for granted with text. Such symbolic libraries would serve as the foundation for the emerging field of computational musicology, and provide data for a wide variety of fusions between music, computer science, and statistics. Equally exciting are applications such as the digital music stand, and systems that support practice and learning through objective analysis of rhythm and pitch. In spite of this promise, progress in OMR has been slow; even the best systems, both commercial and academic, leave much to be desired[7]. In many cases the effort needed to correct OMR output may be more than that of entering the music data from scratch[8]. In such cases OMR systems fail to make any meaningful contribution at all. The reason for these disappointing results is simply that OMR is hard. Bainbridge [17] discusses some the challenges of OMR that impede its development. One central problem is that music notation contains a large variety of somewhat-rare musical symbols and conventions [4], such as articulations, bowings, tremolos, fingerings, accents, harmonics, stops, repeat marks, 1st and 2nd endings, dal segno and da capo markings, trills, mordants, turns, breath marks, etc. While one can easily build recognizers that accommodate these somewhat-unusual symbols and special notational cases, the false positive detections that result often outweigh the additional correct detections they produce. Under some circumstances, some not-so-rare symbols fall into this better-not-to-recognize category, such as augmentation dots, double sharps, and partial beams. Another issue arises from the difficulty in describing the high-level structure of music notation. Objects such as chords, beamed groups, and clef-key-signatures, are highly structured and lend themselves naturally to grammatical representation, however, the overall organization of symbols within a measure is far less constrained. The OMR literature contains several efforts to formulate a unified grammar for music notation [10, 11]. These approaches represent grammars of primitive symbols (beams, flags, note heads, stems, etc.) and begin by assuming a collection of segmented primitives. While our grammars have significant overlap with these approaches, one of our primary uses for the grammar is the segmentation of the symbols into primitives — we do not believe it is realistic to identify the primitives without understanding the larger structures that contain them. Kopec [12] describes a compelling Markov Source Model for music recognition that simultaneously segments and recognizes. However, the approach addresses a small subset of music notation and does not generalize in any obvious way. In particular, our primary focus is on the International Music Score Library Project (IMSLP), while Kopec’s model covers a small minority of the examples encountered there. Other difficulties stem from the kinds of image degradation encountered, including poor or variable contrast, skew and warping of an image caused when the document is not aligned or flat in the scanner bed, hand-written marks, damage to pages, as well as other sources. Some recent research has been dedicated to the improvement of fully automated OMR systems in post-process fashion, or other ways that leave the core recognition engine intact. These efforts either create systems that adapt automatically [16, 24], add musically meaningful constraints for recognition [1, 5], or combine multiple recognizers to achieve better accuracy [9, 7]. However, OMR research is still a long way from our shared goal of creating large scale symbolic music databases. Hankinson et al. [15] created a prototype system for distributed large-scale OMR, which converts a collection of Gregorian chant scores into symbolic files to facilitate their in situ content-based retrieval, though the approach still requires a large amount of careful proofreading and correction. In light of these many obstacles and our collective past history, it seems unwise to bet on fully automated OMR systems that will produce high-quality results with any consistency. Instead we favor casting the problem as an interactive one, thus putting the human in the computational loop. In this case the essential challenge becomes one of minimizing the user’s effort, putting as much burden as possible on the computer, (but no more). There are many creative ways to integrate a person into the recognition pipeline, allowing her to correct, give hints, or direct the computation. This work constitutes an effort in this direction. Our first attempt to bring the human into OMR pipeline built a user interface allowing the correction of individual primitives: stem, beam, note head, single flag, sharp, augmentation dot, etc. Thus the user’s task was simply to cover the image ink by adding and deleting appropriate primitives. A benefit of this approach is that it presents the user with a clearly-defined task that doesn’t require knowledge of the system’s inner workings. There are, however, several weaknesses to this approach: the human tagging process is laborious; it fails to provide important syntactic relations between primitives; it requires the person to precisely register the primitive with the image; and it allows the person to create uninterpretable configurations of primitives (say a stem with no note head) creating havoc further down the OMR pipeline. Our aim here is to improve on all these weaknesses while still presenting a simple task to the user. Our current approach first presents the user with the original recognition results, obtained through fully automatic means. The user may then label any individual pixel according to the recognition task at hand. For instance, during system recognition the user may label a pixel as white space or bar line, while during measure recognition we use a richer collection of labels including, closed/half/whole note head, stem, ledger line, beam, sharp, single flag, etc. The system then re-recognizes subject to the user-imposed constraint. Since our recognizers embed highly restrictive assumptions on the primitives they assemble, a single correction often fixes a number of problems at once. Human and machine then iterate the process of providing and synthesizing human-supplied constraints into recognized results. This approach leaves the registration problem — the precise location of primitives — in the hands of the machine, where we believe it belongs. Furthermore, since our system can only recognize meaningful configurations of symbols, we avoid the problem of trying to assemble human-tagged composite symbols that may not make sense. While the resulting process may still be laborious, our results indicate that the human burden can be reduced considerably by employing this strategy. Furthermore, there are many other ways of introducing human-specified constraints into the recognition process, thus the current effort constitutes an initial exploration of a longer-term goal. Interactive OMR Various authors, such as Rebelo [13], suggest that interactive OMR system could be a realistic solution to the problem, though the central challenge of fusing the human and machine contributions still remains open. Human-in-the-loop computation has received considerable attention recently [23]. It has been applied to a wide variety of areas, such as retrieval systems [19], object classification [20], character recognition [18], document indexing [25], image labeling [22] and fined-grained visual categorization [21]. Romero [26] proposed a Hidden Markov Model (HMM) for computer-assisted text transcription, in which the user-imposed prefix is used to constrain both the sequence decoding and language priors. The potential of all these different applications is summarized in von Ahn’s statement [18]: “Human processing power can be harnessed to solve problems that computer cannot yet solve.” There have already been several OMR systems taking into account human-in-the-loop computation. For instance, Fujinaga [4] proposed an adaptive system that could incrementally improve its symbol classifiers based on human feedback. Church [6] implemented an interface accepting user feedback to guide misrecognized measures toward similar correct measures found elsewhere in the score. Our system uses human feedback in an entirely different manner — as a means of constraining the recognition process in a user-specified manner, thus leveraging the user’s input in the heart of the system. It is worth noting that our approach constitutes a generic framework that poses human-in-theloop recognition as constrained optimization, applicable beyond the specific confines of OMR. Human-Directed Recognition As motivation consider the example given in Figure 1. Suppose our recognition misses the upper note head of the chord (Figure 1b). Then suppose the user labels a single pixel that belongs to the missing note head as solid head (Figure 1c). When the system re-recognizes subject to this constraint, the note head, its associated ledger line, accidental and stem portion may all b

[1]  Donald Byrd,et al.  Prospects for Improving OMR with Multiple Recognizers , 2006, ISMIR.

[2]  Michael Scott Cuthbert,et al.  Improving Rhythmic Transcriptions via Probability Models Applied Post-OMR , 2014, ISMIR.

[3]  Carlos Guedes,et al.  Optical music recognition: state-of-the-art and open issues , 2012, International Journal of Multimedia Information Retrieval.

[4]  Manuel Blum,et al.  reCAPTCHA: Human-Based Character Recognition via Web Security Measures , 2008, Science.

[5]  Ichiro Fujinaga,et al.  Creating a large-scale searchable digital collection from printed music materials , 2012, WWW.

[6]  Christopher Raphael,et al.  Interpreting Rhythm in Optical Music Recognition , 2012, ISMIR.

[7]  Pietro Perona,et al.  Visual Recognition with Humans in the Loop , 2010, ECCV.

[8]  Timothy C. Bell,et al.  The Challenge of Optical Music Recognition , 2001, Comput. Humanit..

[9]  Kun Duan,et al.  Discovering localized attributes for fine-grained recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Bruce W. Pennycook,et al.  Adaptive optical music recognition , 1997 .

[11]  Pierfrancesco Bellini,et al.  Assessing Optical Music Recognition Tools , 2007, Computer Music Journal.

[12]  Ichiro Fujinaga,et al.  The Gamera framework for building custom recognition systems , 2003 .

[13]  David A. Maltz,et al.  Markov source model for printed music decoding , 1995, Electronic Imaging.

[14]  Alejandro Héctor Toselli,et al.  Computer Assisted Transcription for Ancient Text Images , 2007, ICIAR.

[15]  Ichiro Fujinaga,et al.  Correcting Large-Scale OMR Data with Crowdsourcing , 2014, DLfM '14.

[16]  William A. Barrett,et al.  Intelligent indexing: a semi-automated, trainable system for field labeling , 2015, Electronic Imaging.

[17]  Linn Saxrud Johansen Optical Music Recognition , 2009 .

[18]  Donald Byrd,et al.  Towards Musicdiff: A Foundation for Improved Optical Music Recognition Using Multiple Recognizers , 2007, ISMIR.

[19]  Carla E. Brodley,et al.  ASSERT: A Physician-in-the-Loop Content-Based Retrieval System for HRCT Image Databases , 1999, Comput. Vis. Image Underst..

[20]  Christopher Raphael,et al.  Optical music recognition on the International Music Score Library Project , 2013, Electronic Imaging.

[21]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[22]  C. Brisset Using Logic Programming Languages For Optical Music Recognition , 1995 .

[23]  Dorothea Blostein,et al.  A Graph-Rewriting Paradigm for Discrete Relaxation: Application to Sheet-Music Recognition , 1998, Int. J. Pattern Recognit. Artif. Intell..

[24]  Christopher Raphael,et al.  New Approaches to Optical Music Recognition , 2011, ISMIR.

[25]  Benjamin B. Bederson,et al.  Human computation: a survey and taxonomy of a growing field , 2011, CHI.

[26]  Isabelle Bloch,et al.  Robust and Adaptive OMR System Including Fuzzy Modeling, Fusion of Musical Rules, and Possible Error Detection , 2007, EURASIP J. Adv. Signal Process..