Deep Learning Models for Melody Perception: An Investigation on Symbolic Music Data

We investigate the deep learning approaches on the melody extraction problem on symbolic music data. Specifically, we compare two different approaches: the first one employs recurrent neural networks (RNN) by considering melody extraction as a sequence prediction problem, while the second employs fully convolutional networks (FCN) by considering it as a image semantic segmentation problem. Both methods are tested against a MIDI dataset with melody tracks acting as ground truth. A more challenging case that the melodies are shifted by one octave is also considered. Evaluation results show the advantage of the semantic segmentation approach in terms of the accuracy.

[1]  Sangeun Kum,et al.  Melody Extraction on Vocal Segments Using Multi-Column Deep Neural Networks , 2016, ISMIR.

[2]  Andrew McLeod,et al.  HMM-Based Voice Separation of MIDI Performance , 2016 .

[3]  M. R. Jones Dynamic pattern structure in music: Recent theory and research , 1987, Perception & psychophysics.

[4]  Emilio Molina,et al.  Evaluation Framework for Automatic Singing Transcription , 2014, ISMIR.

[5]  Diana Deutsch An illusion with musical scales , 1974 .

[6]  Emilia Gómez,et al.  Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Frank Nielsen,et al.  DeepBach: a Steerable Model for Bach Chorales Generation , 2016, ICML.

[8]  Prateek Verma,et al.  Frequency Estimation from Waveforms Using Multi-Layered Neural Networks , 2016, INTERSPEECH.

[9]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[10]  Yannis Manolopoulos,et al.  MUSICAL VOICE INTEGRATION/SEGREGATION: VISAREVISITED , 2009 .

[11]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Ryo Nishikimi,et al.  Musical Note Estimation for F0 Trajectories of Singing Voices Based on a Bayesian Semi-Beat-Synchronous HMM , 2016, ISMIR.

[13]  Masataka Goto,et al.  A real-time music-scene-description system: predominant-F0 estimation for detecting melody and bass lines in real-world audio signals , 2004, Speech Commun..

[14]  Elaine Chew,et al.  Separating Voices in Polyphonic Music: A Contig Mapping Approach , 2004, CMMR.

[15]  Justin Salamon,et al.  Deep Salience Representations for F0 Estimation in Polyphonic Music , 2017, ISMIR.

[16]  David Temperley,et al.  A Probabilistic Model of Melody Perception , 2008, ISMIR.

[17]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[18]  Nicolas Guiomard-Kagan,et al.  Improving Voice Separation by Better Connecting Contigs , 2016, ISMIR.

[19]  Craig Stuart Sapp,et al.  Search Effectiveness Measures for Symbolic Music Queries in Very Large Databases , 2004, ISMIR.

[20]  JUSTIN,et al.  Pitch Analysis for Active Music Discovery , 2016 .

[21]  Slim Essid,et al.  Melody Extraction by Contour Classification , 2015, ISMIR.

[22]  Nicola Orio,et al.  Musical information retrieval using melodic surface , 1999, DL '99.

[23]  Dimos Makris VISA: REFINING THE VOICE INTEGRATION/SEGREGATION ALGORITHM , 2018 .

[24]  Li Su,et al.  Vocal Melody Extraction with Semantic Segmentation and Audio-symbolic Domain Transfer Learning , 2018, ISMIR.

[25]  Brian Christopher Smith,et al.  Query by humming: musical information retrieval in an audio database , 1995, MULTIMEDIA '95.

[26]  Razvan C. Bunescu,et al.  A Neural Greedy Model for Voice Separation in Symbolic Music , 2016, ISMIR.

[27]  Daniel P. W. Ellis,et al.  MIR_EVAL: A Transparent Implementation of Common MIR Metrics , 2014, ISMIR.

[28]  Nicolas Guiomard-Kagan,et al.  Comparing Voice and Stream Segmentation Algorithms , 2015, ISMIR.

[29]  W. Jay Dowling,et al.  Expectancy and attention in melody perception. , 1990 .

[30]  Simon Dixon,et al.  Computer-aided Melody Note Transcription Using the Tony Software: Accuracy and Efficiency , 2015 .

[31]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[32]  Jordan B. L. Smith,et al.  Probabilistic transcription of sung melody using a pitch dynamic model , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  François Rigaud,et al.  Singing Voice Melody Transcription Using Deep Neural Networks , 2016, ISMIR.

[34]  C. Chuan Tone and Voice: A Derivation of the Rules of Voice-Leading from Perceptual Principles , 2001 .

[35]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[36]  Yannis Manolopoulos,et al.  Horizontal and Vertical Integration/Segregation in Auditory Streaming: A Voice Separation Algorithm for Symbolic Musical Data , 2007 .