CNN Based Two-stage Multi-resolution End-to-end Model for Singing Melody Extraction

Inspired by human hearing perception, we propose a two-stage multi-resolution end-to-end model for singing melody extraction in this paper. The convolutional neural network (CNN) is the core of the proposed model to generate multi-resolution representations. The 1-D and 2-D multi-resolution analysis on waveform and spectrogram-like graph are successively carried out by using 1-D and 2-D CNN kernels of different lengths and sizes. The 1-D CNNs with kernels of different lengths produce multi-resolution spectrogram-like graphs without suffering from the trade-off between spectral and temporal resolutions. The 2-D CNNs with kernels of different sizes extract features from spectro-temporal envelopes of different scales. Experiment results show the proposed model outperforms three compared systems in three out of five public databases.

[1]  Tara N. Sainath,et al.  Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[2]  Jesse Engel,et al.  Learning Multiscale Features Directly from Waveforms , 2016, INTERSPEECH.

[3]  Daniel P. W. Ellis,et al.  Classification-based melody transcription , 2006, Machine Learning.

[4]  Tai-Shih Chi,et al.  A Hybrid Neural Network Based on the Duplex Model of Pitch Perception for Singing Melody Extraction , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Daniel P. W. Ellis,et al.  MIR_EVAL: A Transparent Implementation of Common MIR Metrics , 2014, ISMIR.

[6]  Katsutoshi Itoyama,et al.  Singing voice analysis and editing based on mutually dependent F0 estimation and source separation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Jyh-Shing Roger Jang,et al.  On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Anssi Klapuri,et al.  Identifying Cover Songs Using Information-Theoretic Measures of Similarity , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Prateek Verma,et al.  Frequency Estimation from Waveforms Using Multi-Layered Neural Networks , 2016, INTERSPEECH.

[10]  Matthias Mauch,et al.  MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research , 2014, ISMIR.

[11]  DeLiang Wang,et al.  Neural Network Based Pitch Tracking in Very Noisy Speech , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Daniel P. W. Ellis,et al.  Melody Extraction from Polyphonic Music Signals: Approaches, applications, and challenges , 2014, IEEE Signal Processing Magazine.

[13]  Benjamin Schrauwen,et al.  End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Jyh-Shing Roger Jang,et al.  Improving Query-by-Singing/Humming by Combining Melody and Lyric Information , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Tillman Weyde,et al.  Singing Voice Separation with Deep U-Net Convolutional Networks , 2017, ISMIR.

[17]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[18]  Yi-Hsuan Yang,et al.  Vocal activity informed singing voice separation with the iKala dataset , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Sangeun Kum,et al.  Melody Extraction on Vocal Segments Using Multi-Column Deep Neural Networks , 2016, ISMIR.

[20]  Emilia Gómez,et al.  Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Li Su,et al.  Vocal Melody Extraction Using Patch-Based CNN , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).