Singing Melody Extraction from Polyphonic Music based on Spectral Correlation Modeling

Convolutional neural network (CNN) based methods have achieved state-of-the-art performance for singing melody extraction from polyphonic music. However, most of these methods focus on the learning of local features, while relationships among spectral components locating far apart are often neglected. In this paper, we explore the idea of modeling spectral correlation explicitly for melody extraction. Specifically, we present a spectral correlation module (SCM) that can learn to model the relationships among all frequency bands in a time-frequency representation, thus allowing the encoding of global spectral information into a conventional CNN. Furthermore, we propose to integrate center frequencies with the input feature map of SCM to improve the performance. We implement a light-weight model comprised of SCM blocks to verify the efficacy of our system. Our system achieves a state-of-the-art overall accuracy of 83.5% on the MedleyDB dataset.