Structure-Aware Audio-to-Score Alignment Using Progressively Dilated Convolutional Neural Networks

The identification of structural differences between a music performance and the score is a challenging yet integral step of audio-to-score alignment, an important subtask of music information retrieval. We present a novel method to detect such differences between the score and performance for a given piece of music using progressively dilated convolutional neural networks. Our method incorporates varying dilation rates at different layers to capture both short-term and long-term context, and can be employed successfully in the presence of limited annotated data. We conduct experiments on audio recordings of real performances that differ structurally from the score, and our results demonstrate that our models outperform standard methods for structure-aware audio-to-score alignment.

[1]  C. Raphael,et al.  OFFLINE SCORE ALIGNMENT FOR REALISTIC MUSIC PRACTICE , 2019 .

[2]  Gerhard Widmer,et al.  Getting Closer to the Essence of Music , 2016, ACM Trans. Intell. Syst. Technol..

[3]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[4]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[5]  Mengyi Shan,et al.  Improved Handling of Repeats and Jumps in Audio-Sheet Image Synchronization , 2020, ArXiv.

[6]  A. Arzt SIMPLE TEMPO MODELS FOR REAL-TIME MUSIC TRACKING , 2010 .

[7]  Craig Stuart Sapp Comparative Analysis of Multiple Musical Performances , 2007, ISMIR.

[8]  Meinard Mller,et al.  Fundamentals of Music Processing: Audio, Analysis, Algorithms, Applications , 2015 .

[9]  Gerhard Widmer,et al.  Towards Effective 'Any-Time' Music Tracking , 2010, STAIRS.

[10]  Gerhard Widmer,et al.  Score Following as a Multi-Modal Reinforcement Learning Problem , 2019, Trans. Int. Soc. Music. Inf. Retr..

[11]  Gerhard Widmer,et al.  Learning Audio-Sheet Music Correspondences for Cross-Modal Retrieval and Piece Identification , 2018, Trans. Int. Soc. Music. Inf. Retr..

[12]  Sebastian Böck,et al.  Improved musical onset detection with Convolutional Neural Networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Matteo Negri,et al.  Contextual Handling in Neural Machine Translation: Look behind, ahead and on both sides , 2018, EAMT.

[14]  Mark B. Sandler,et al.  Structural Segmentation of Musical Audio by Constrained Clustering , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Bing Zhang,et al.  A Review of Remote Sensing Image Classification Techniques: the Role of Spatio-contextual Information , 2014 .

[16]  Gerhard Widmer,et al.  Automatic Alignment of Music Performances with Structural Differences , 2013, ISMIR.

[17]  Simon Dixon,et al.  Learning Frame Similarity using Siamese networks for Audio-to-Score Alignment , 2021, 2020 28th European Signal Processing Conference (EUSIPCO).

[18]  Gerhard Widmer,et al.  Learning to Listen, Read, and Follow: Score Following as a Reinforcement Learning Game , 2018, ISMIR.

[19]  Christian Fremerey,et al.  Automatic organization of digital music documents: sheet music and audio , 2010 .

[20]  Dong Wang,et al.  Histogram matching for music repetition detection , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[21]  Meinard Müller,et al.  Towards Automated Extraction of Tempo Parameters from Expressive Music Recordings , 2009, ISMIR.

[22]  Simon Dixon,et al.  A Hybrid Approach to Audio-to-Score Alignment , 2020, ArXiv.

[23]  Heesung Kwon,et al.  Going Deeper With Contextual CNN for Hyperspectral Image Classification , 2016, IEEE Transactions on Image Processing.

[24]  Eita Nakamura,et al.  Real-Time Audio-to-Score Alignment of Music Performances Containing Errors and Arbitrary Repeats and Skips , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Meinard Müller,et al.  An Efficient Multiscale Approach to Audio Synchronization , 2006, ISMIR.

[26]  Simon Dixon,et al.  An On-Line Time Warping Algorithm for Tracking Musical Performances , 2005, IJCAI.

[27]  Colin Raffel,et al.  Onsets and Frames: Dual-Objective Piano Transcription , 2017, ISMIR.

[28]  Meinard Müller,et al.  Handling Repeats and Jumps in Score-performance Synchronization , 2010, ISMIR.

[29]  Thomas Grill,et al.  Boundary Detection in Music Structure Analysis using Convolutional Neural Networks , 2014, ISMIR.

[30]  Gerhard Widmer,et al.  Learning Soft-Attention Models for Tempo-invariant Audio-Sheet Music Retrieval , 2019, ISMIR.

[31]  Meinard Müller,et al.  MIDI-Sheet Music Alignment Using Bootleg Score Synthesis , 2019, ISMIR.