Using Cell Phone Pictures of Sheet Music To Retrieve MIDI Passages

This article investigates a cross-modal retrieval problem in which a user would like to retrieve a passage of music from a MIDI file by taking a cell phone picture of several lines of sheet music. This problem is challenging for two reasons: it has a significant runtime constraint since it is a user-facing application, and there is very little relevant training data containing cell phone images of sheet music. To solve this problem, we introduce a novel feature representation called a bootleg score which encodes the position of noteheads relative to staff lines in sheet music. The MIDI representation can be converted into a bootleg score using deterministic rules of Western musical notation, and the sheet music image can be converted into a bootleg score using classical computer vision techniques for detecting simple geometrical shapes. Once the MIDI and cell phone image have been converted into bootleg scores, we can estimate the alignment using dynamic programming. The most notable characteristic of our system is that it has no trainable weights at all — only a set of about 40 hyperparameters. With a training set of just 400 images, we show that our system generalizes well to a much larger set of 1600 test images from 160 unseen musical scores. Our system achieves a test F measure score of 0.89, has an average runtime of 0.90 seconds, and outperforms baseline systems based on music object detection and sheet–audio alignment. We provide extensive experimental validation and analysis of our system.

[1]  Özgür Izmirli,et al.  Bridging Printed Music and Audio Through Alignment Using a Mid-level Score Representation , 2012, ISMIR.

[2]  Meinard Müller,et al.  A digital library framework for heterogeneous music collections: from document acquisition to cross-modal interaction , 2012, International Journal on Digital Libraries.

[3]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[4]  Xiaogang Wang,et al.  Bridging Music and Image via Cross-Modal Ranking Analysis , 2016, IEEE Transactions on Multimedia.

[5]  Gerhard Widmer,et al.  Live Score Following on Sheet Music Images , 2016, ArXiv.

[6]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[7]  Navjot Singh,et al.  PICS: A Novel Technique for Video Summarization , 2019 .

[8]  Krishan Kumar,et al.  EVS-DK: Event video skimming using deep keyframe , 2019, J. Vis. Commun. Image Represent..

[9]  Andrew Zisserman,et al.  Learnable PINs: Cross-Modal Embeddings for Person Identity , 2018, ECCV.

[10]  Joanna Isabelle Olszewska Designing Transparent and Autonomous Intelligent Vision Systems , 2019, ICAART.

[11]  Mark J. Huiskes,et al.  The MIR flickr retrieval evaluation , 2008, MIR '08.

[12]  Meinard Müller,et al.  Automatic Mapping of Scanned Sheet Music to Audio Recordings , 2008, ISMIR.

[13]  Yunde Jia,et al.  Heterogeneous Hashing Network for Face Retrieval Across Image and Video Domains , 2019, IEEE Transactions on Multimedia.

[14]  Lei Zhu,et al.  An Efficient Approach for Geo-Multimedia Cross-Modal Retrieval , 2018, IEEE Access.

[15]  Gerhard Widmer,et al.  Learning to Listen, Read, and Follow: Score Following as a Reinforcement Learning Game , 2018, ISMIR.

[16]  Qi Tian,et al.  Generalized Semi-supervised and Structured Subspace Learning for Cross-Modal Retrieval , 2018, IEEE Transactions on Multimedia.

[17]  Hyung Jeong Yang,et al.  Distorted Music Score Recognition without Staffline Removal , 2014, 2014 22nd International Conference on Pattern Recognition.

[18]  Dong Wang,et al.  Deep Memory Network for Cross-Modal Retrieval , 2019, IEEE Transactions on Multimedia.

[19]  Meinard Müller,et al.  Multimodal presentation and browsing of music , 2008, ICMI '08.

[20]  Gerhard Widmer,et al.  Learning Audio-Sheet Music Correspondences for Cross-Modal Retrieval and Piece Identification , 2018, Trans. Int. Soc. Music. Inf. Retr..

[21]  Gerhard Widmer,et al.  Cross-Modal Music Retrieval and Applications: An Overview of Key Methodologies , 2019, IEEE Signal Processing Magazine.

[22]  Jorge Calvo-Zaragoza,et al.  Optical Music Recognition in Mensural Notation with Region-based Convolutional Neural Networks , 2018, ISMIR.

[23]  Meinard Müller,et al.  MIDI-Sheet Music Alignment Using Bootleg Score Synthesis , 2019, ISMIR.

[24]  Meinard Mller,et al.  Fundamentals of Music Processing: Audio, Analysis, Algorithms, Applications , 2015 .

[25]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Xuelong Li,et al.  Deep Binary Reconstruction for Cross-Modal Hashing , 2019 .

[27]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Xiaojun Chang,et al.  Adaptive Semi-Supervised Feature Selection for Cross-Modal Retrieval , 2019, IEEE Transactions on Multimedia.

[29]  Meinard Müller,et al.  Linking Sheet Music and Audio - Challenges and New Approaches , 2012, Multimodal Music Processing.

[30]  Gerhard Widmer,et al.  Towards Full-Pipeline Handwritten OMR with Musical Symbol Detection by U-Nets , 2018, ISMIR.

[31]  Jan Hajic,et al.  A Baseline for General Music Object Detection with Deep Learning , 2018, Applied Sciences.

[32]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[33]  Jorge Calvo-Zaragoza,et al.  Camera-PrIMuS: Neural End-to-End Optical Music Recognition on Realistic Monophonic Scores , 2018, ISMIR.

[34]  Jürgen Schmidhuber,et al.  DeepScores-A Dataset for Segmentation, Detection and Classification of Tiny Objects , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).

[35]  Alejandro Héctor Toselli,et al.  Handwritten Music Recognition for Mensural notation with convolutional recurrent neural networks , 2019, Pattern Recognit. Lett..

[36]  Xuelong Li,et al.  Deep Binary Reconstruction for Cross-Modal Hashing , 2017, IEEE Transactions on Multimedia.

[37]  Horst M. Eidenberger,et al.  Handwritten Music Object Detection: Open Issues and Baseline Results , 2018, 2018 13th IAPR International Workshop on Document Analysis Systems (DAS).

[38]  Adria Rico Blanes,et al.  Camera-Based Optical Music Recognition Using a Convolutional Neural Network , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[39]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[41]  Meinard Müller,et al.  Automated Synchronization of Scanned Sheet Music with Audio Recordings , 2007, ISMIR.

[42]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[43]  Luis Herranz,et al.  Being a Supercook: Joint Food Attributes and Multimodal Content Modeling for Recipe Retrieval and Exploration , 2017, IEEE Transactions on Multimedia.

[44]  Hyung Jeong Yang,et al.  An MRF model for binarization of music scores with complex background , 2016, Pattern Recognit. Lett..

[45]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[46]  Karen Ullrich,et al.  Optical Music Recognition with Convolutional Sequence-to-Sequence Models , 2017, ISMIR.

[47]  Gerhard Widmer,et al.  Towards Score Following In Sheet Music Images , 2016, ISMIR.

[48]  Pavel Pecina,et al.  The MUSCIMA++ Dataset for Handwritten Optical Music Recognition , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[49]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[50]  Mengyi Shan,et al.  MIDI Passage Retrieval Using Cell Phone Pictures of Sheet Music , 2019, ISMIR.

[51]  Xiang Jia,et al.  State-of-the-Art Model for Music Object Recognition with Deep Learning , 2019, Applied Sciences.

[52]  Soo-Hyung Kim,et al.  Staff Line Removal Using Line Adjacency Graph and Staff Line Skeleton for Camera-Based Printed Music Scores , 2014, 2014 22nd International Conference on Pattern Recognition.

[53]  Meinard Müller,et al.  Fundamentals of Music Processing , 2015, Springer International Publishing.

[54]  Jun Guo,et al.  Cross-modal subspace learning for fine-grained sketch-based image retrieval , 2017, Neurocomputing.

[55]  Gerhard Widmer,et al.  Learning Audio-Sheet Music Correspondences for Score Identification and Offline Alignment , 2017, ISMIR.

[56]  Jürgen Schmidhuber,et al.  Deep Watershed Detector for Music Object Recognition , 2018, ISMIR.

[57]  Meinard Müller,et al.  Sheet Music-Audio Identification , 2009, ISMIR.

[58]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[59]  Yongfeng Huang,et al.  Twitter100k: A Real-World Dataset for Weakly Supervised Cross-Media Retrieval , 2017, IEEE Transactions on Multimedia.

[60]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.