论文信息 - Main Melody Estimation with Source-Filter NMF and CRNN

Main Melody Estimation with Source-Filter NMF and CRNN

Estimating the main melody of a polyphonic audio recording remains a challenging task. We approach the task from a classification perspective and adopt a convolutional recurrent neural network (CRNN) architecture that relies on a particular form of pretraining by source-filter nonnegative matrix factorisation (NMF). The source-filter NMF decomposition is chosen for its ability to capture the pitch and timbre content of the leading voice/instrument, providing a better initial pitch salience than standard timefrequency representations. Starting from such a musically motivated representation, we propose to further enhance the NMF-based salience representations with CNN layers, then to model the temporal structure by an RNN network and to estimate the dominant melody with a final classification layer. The results show that such a system achieves state-of-the-art performance on the MedleyDB dataset without any augmentation methods or large training sets.

Slim Essid | Geoffroy Peeters | Dogac Basaran

[1] Meinard Müller,et al. Data-driven solo voice enhancement for jazz music retrieval , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Emmanuel Vincent,et al. Adaptive Harmonic Spectral Decomposition for Multiple Pitch Estimation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[3] Gaël Richard,et al. Leveraging deep neural networks with nonnegative representations for improved environmental sound classification , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[4] Gaël Richard,et al. Source/Filter Model for Unsupervised Main Melody Extraction From Polyphonic Audio Signals , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[5] Yoshua Bengio,et al. Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[6] Matthias Mauch,et al. MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research , 2014, ISMIR.

[7] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[10] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[11] François Rigaud,et al. Singing Voice Melody Transcription Using Deep Neural Networks , 2016, ISMIR.

[12] György Fazekas,et al. A Tutorial on Deep Learning for Music Information Retrieval , 2017, ArXiv.

[13] Gaël Richard,et al. A Musically Motivated Mid-Level Representation for Pitch Estimation and Musical Audio Source Separation , 2011, IEEE Journal of Selected Topics in Signal Processing.

[14] Emilia Gómez,et al. A Comparison of Melody Extraction Methods Based on Source-Filter Modelling , 2016, ISMIR.

[15] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[16] Daniel P. W. Ellis,et al. Melody Extraction from Polyphonic Music Signals: Approaches, applications, and challenges , 2014, IEEE Signal Processing Magazine.

[17] Anssi Klapuri,et al. Multiple Fundamental Frequency Estimation by Summing Harmonic Amplitudes , 2006, ISMIR.

[18] Justin Salamon,et al. Deep Salience Representations for F0 Estimation in Polyphonic Music , 2017, ISMIR.

[19] Emilia Gómez,et al. Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[20] Roland Badeau,et al. Singing voice detection with deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Tuomas Virtanen,et al. Stacked convolutional and recurrent neural networks for bird audio detection , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[22] Antoine Liutkus,et al. Probabilistic model for main melody extraction using Constant-Q transform , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).