论文信息 - CNN Based Two-stage Multi-resolution End-to-end Model for Singing Melody Extraction

CNN Based Two-stage Multi-resolution End-to-end Model for Singing Melody Extraction

Inspired by human hearing perception, we propose a two-stage multi-resolution end-to-end model for singing melody extraction in this paper. The convolutional neural network (CNN) is the core of the proposed model to generate multi-resolution representations. The 1-D and 2-D multi-resolution analysis on waveform and spectrogram-like graph are successively carried out by using 1-D and 2-D CNN kernels of different lengths and sizes. The 1-D CNNs with kernels of different lengths produce multi-resolution spectrogram-like graphs without suffering from the trade-off between spectral and temporal resolutions. The 2-D CNNs with kernels of different sizes extract features from spectro-temporal envelopes of different scales. Experiment results show the proposed model outperforms three compared systems in three out of five public databases.

[1] Tara N. Sainath,et al. Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[2] Jesse Engel,et al. Learning Multiscale Features Directly from Waveforms , 2016, INTERSPEECH.

[3] Daniel P. W. Ellis,et al. Classification-based melody transcription , 2006, Machine Learning.

[4] Tai-Shih Chi,et al. A Hybrid Neural Network Based on the Duplex Model of Pitch Perception for Singing Melody Extraction , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Daniel P. W. Ellis,et al. MIR_EVAL: A Transparent Implementation of Common MIR Metrics , 2014, ISMIR.

[6] Katsutoshi Itoyama,et al. Singing voice analysis and editing based on mutually dependent F0 estimation and source separation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Jyh-Shing Roger Jang,et al. On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[8] Anssi Klapuri,et al. Identifying Cover Songs Using Information-Theoretic Measures of Similarity , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9] Prateek Verma,et al. Frequency Estimation from Waveforms Using Multi-Layered Neural Networks , 2016, INTERSPEECH.

[10] Matthias Mauch,et al. MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research , 2014, ISMIR.

[11] DeLiang Wang,et al. Neural Network Based Pitch Tracking in Very Noisy Speech , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12] Daniel P. W. Ellis,et al. Melody Extraction from Polyphonic Music Signals: Approaches, applications, and challenges , 2014, IEEE Signal Processing Magazine.

[13] Benjamin Schrauwen,et al. End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Jyh-Shing Roger Jang,et al. Improving Query-by-Singing/Humming by Combining Melody and Lyric Information , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16] Tillman Weyde,et al. Singing Voice Separation with Deep U-Net Convolutional Networks , 2017, ISMIR.

[17] Powen Ru,et al. Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[18] Yi-Hsuan Yang,et al. Vocal activity informed singing voice separation with the iKala dataset , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Sangeun Kum,et al. Melody Extraction on Vocal Segments Using Multi-Column Deep Neural Networks , 2016, ISMIR.

[20] Emilia Gómez,et al. Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[21] Li Su,et al. Vocal Melody Extraction Using Patch-Based CNN , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).