Don’t hide in the frames: Note- and pattern-based evaluation of automated melody extraction algorithms

In this paper, we address how to evaluate and improve the performance of automatic dominant melody extraction systems from a pattern mining perspective with a focus on jazz improvisations. Traditionally, dominant melody extraction systems estimate the melody on the frame-level, but for real-world musicological applications note-level representations are needed. For the evaluation of estimated note tracks, the current frame-wise metrics are not fully suitable and provide at most a first approximation. Furthermore, mining melodic patterns (n-grams) poses another challenge because note-wise errors propagate geometrically with increasing length of the pattern. On the other hand, for certain derived metrics such as pattern commonalities between performers, extraction errors might be less critical if at least qualitative rankings can be reproduced. Finally, while searching for similar patterns in a melody database the number of irrelevant patterns in the result set increases with lower similarity thresholds. For reasons of usability, it would be interesting to know the behavior using imperfect automated melody extractions. We propose three novel evaluation strategies for estimated note-tracks based on three application scenarios: Pattern mining, pattern commonalities, and fuzzy pattern search. We apply the proposed metrics to one general state-of-the-art melody estimation method (Melodia) and to two variants of an algorithm that was optimized for the extraction of jazz solos melodies. A subset of the Weimar Jazz Database with 91 solos was used for evaluation. Results show that the optimized algorithm clearly outperforms the reference algorithm, which quickly degrades and eventually breaks down for longer n-grams. Frame-wise metrics provide indeed an estimate for note-wise metrics, but only for sufficiently good extractions, whereas F1 scores for longer n-grams cannot be predicted from frame-wise F1 scores at all. The ranking of pattern commonalities between performers can be reproduced with the optimized algorithms but not with the reference algorithm. Finally, the size of result sets of pattern similarity searches decreases for automated note extraction and for larger similarity thresholds but the difference levels out for smaller thresholds.

[1]  Justin Salamon,et al.  Deep Salience Representations for F0 Estimation in Polyphonic Music , 2017, ISMIR.

[2]  Slim Essid,et al.  Main Melody Estimation with Source-Filter NMF and CRNN , 2018, ISMIR.

[3]  Anja Volk,et al.  Melodic similarity among folk songs: An annotation study on similarity-based categorization in music , 2012 .

[4]  Thomas Owens,et al.  Charlie Parker : techniques of improvisation , 1974 .

[5]  P. Fraisse The psychology of time , 1963 .

[6]  P. van Kranenburg,et al.  A Computational Approach to Content-Based Retrieval of Folk Song Melodies , 2010 .

[7]  Gaël Richard,et al.  Source/Filter Model for Unsupervised Main Melody Extraction From Polyphonic Audio Signals , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  P. Berliner Thinking in Jazz: The Infinite Art of Improvisation , 1995 .

[9]  Martin Norgaard,et al.  How Jazz Musicians Improvise: The Central Role of Auditory and Motor Patterns , 2014 .

[10]  Daniel P. W. Ellis,et al.  MIR_EVAL: A Transparent Implementation of Common MIR Metrics , 2014, ISMIR.

[11]  Slim Essid,et al.  Melody Extraction by Contour Classification , 2015, ISMIR.

[12]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[13]  Chang Dong Yoo,et al.  Melody extraction and detection through LSTM-RNN with harmonic sum loss , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Meinard Müller,et al.  Towards Evaluating Multiple Predominant Melody Annotations in Jazz Recordings , 2016, ISMIR.

[15]  Daniel Müllensiefen,et al.  Cognitive Adequacy in the Measurement of Melodic Similarity: Algorithmic vs. Human Judgments , 2004 .

[16]  Thomas Gold,et al.  Hearing , 1953, Trans. IRE Prof. Group Inf. Theory.

[17]  Gaël Richard,et al.  A Musically Motivated Mid-Level Representation for Pitch Estimation and Musical Audio Source Separation , 2011, IEEE Journal of Selected Topics in Signal Processing.

[18]  M.P. Ryynanen,et al.  Polyphonic music transcription using note event modeling , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[19]  Li Su,et al.  Vocal Melody Extraction Using Patch-Based CNN , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Emilia Gómez,et al.  A Comparison of Melody Extraction Methods Based on Source-Filter Modelling , 2016, ISMIR.

[21]  Emilia Gómez,et al.  Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Simon Dixon,et al.  Two Web Applications for Exploring Melodic Patterns in Jazz Solos , 2018, ISMIR.

[23]  Meinard Müller,et al.  Deep Learning for Jazz Walking Bass Transcription , 2017, Semantic Audio.