Optical Music Recognition with Convolutional Sequence-to-Sequence Models

Optical Music Recognition (OMR) is an important technology within Music Information Retrieval. Deep learning models show promising results on OMR tasks, but symbol-level annotated data sets of sufficient size to train such models are not available and difficult to develop. We present a deep learning architecture called a Convolutional Sequence-to-Sequence model to both move towards an end-to-end trainable OMR pipeline, and apply a learning process that trains on full sentences of sheet music instead of individually labeled symbols. The model is trained and evaluated on a human generated data set, with various image augmentations based on real-world scenarios. This data set is the first publicly available set in OMR research with sufficient size to train and evaluate deep learning models. With the introduced augmentations a pitch recognition accuracy of 81% and a duration accuracy of 94% is achieved, resulting in a note level accuracy of 80%. Finally, the model is compared to commercially available methods, showing a large improvements over these applications.

[1]  Anselmo Cardoso de Paiva,et al.  A Deep Approach for Handwritten Musical Symbols Recognition , 2016, WebMedia.

[2]  José Oncina,et al.  Staff-line detection and removal using a convolutional neural network , 2017, Machine Vision and Applications.

[3]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[4]  Donald Byrd,et al.  Prospects for Improving OMR with Multiple Recognizers , 2006, ISMIR.

[5]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[6]  Laurent Pugin,et al.  Optical Music Recognitoin of Early Typographic Prints using Hidden Markov Models , 2006, ISMIR.

[7]  Ken Perlin,et al.  Improving noise , 2002, SIGGRAPH.

[8]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  Jaime S. Cardoso,et al.  Optical recognition of music symbols , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[11]  Carlos Guedes,et al.  Optical music recognition: state-of-the-art and open issues , 2012, International Journal of Multimedia Information Retrieval.

[12]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[13]  Anil K. Jain,et al.  Representation and Recognition of Handwritten Digits Using Deformable Templates , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Jing Zhang,et al.  A new optical music recognition system based on combined neural network , 2015, Pattern Recognit. Lett..

[16]  Ichiro Fujinaga,et al.  USING HIDDEN MARKOV MODELS , 2007 .

[17]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[18]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.