论文信息 - Optical Music Recognition with Convolutional Sequence-to-Sequence Models

Optical Music Recognition with Convolutional Sequence-to-Sequence Models

Optical Music Recognition (OMR) is an important technology within Music Information Retrieval. Deep learning models show promising results on OMR tasks, but symbol-level annotated data sets of sufficient size to train such models are not available and difficult to develop. We present a deep learning architecture called a Convolutional Sequence-to-Sequence model to both move towards an end-to-end trainable OMR pipeline, and apply a learning process that trains on full sentences of sheet music instead of individually labeled symbols. The model is trained and evaluated on a human generated data set, with various image augmentations based on real-world scenarios. This data set is the first publicly available set in OMR research with sufficient size to train and evaluate deep learning models. With the introduced augmentations a pitch recognition accuracy of 81% and a duration accuracy of 94% is achieved, resulting in a note level accuracy of 80%. Finally, the model is compared to commercially available methods, showing a large improvements over these applications.

Karen Ullrich | Eelco van der Wel

[1] Anselmo Cardoso de Paiva,et al. A Deep Approach for Handwritten Musical Symbols Recognition , 2016, WebMedia.

[2] José Oncina,et al. Staff-line detection and removal using a convolutional neural network , 2017, Machine Vision and Applications.

[3] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[4] Donald Byrd,et al. Prospects for Improving OMR with Multiple Recognizers , 2006, ISMIR.

[5] Patrice Y. Simard,et al. Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[6] Laurent Pugin,et al. Optical Music Recognitoin of Early Typographic Prints using Hidden Markov Models , 2006, ISMIR.

[7] Ken Perlin,et al. Improving noise , 2002, SIGGRAPH.

[8] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[9] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[10] Jaime S. Cardoso,et al. Optical recognition of music symbols , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[11] Carlos Guedes,et al. Optical music recognition: state-of-the-art and open issues , 2012, International Journal of Multimedia Information Retrieval.

[12] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[13] Anil K. Jain,et al. Representation and Recognition of Handwritten Digits Using Deformable Templates , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[14] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15] Jing Zhang,et al. A new optical music recognition system based on combined neural network , 2015, Pattern Recognit. Lett..

[16] Ichiro Fujinaga,et al. USING HIDDEN MARKOV MODELS , 2007 .

[17] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[18] Xiang Bai,et al. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.