Generalized Metrics for Single-f0 Estimation Evaluation

Single-f0 estimation methods, including pitch trackers and melody estimators, have historically been evaluated using a set of common metrics which score estimates frame-wise in terms of pitch and voicing accuracy. “Voicing” refers to whether or not a pitch is active, and has historically been regarded as a binary value. However, this has limitations because it is often ambiguous whether a pitch is present or absent, making a binary choice difficult for humans and algorithms alike. For example, when a source fades out or reverberates, the exact point where the pitch is no longer present is unclear. Many single-f0 estimation algorithms select a threshold for when a pitch is active or not, and different choices of threshold drastically affect the results of standard metrics. In this paper, we present a refinement on the existing single-f0 metrics, by allowing the estimated voicing to be represented as a continuous likelihood, and introducing a weighting on frame level pitch accuracy, which considers the energy of the source producing the f0 relative to the energy of the rest of the signal. We compare these metrics experimentally with the previous metrics using a number of algorithms and datasets and discuss the fundamental differences. We show that, compared to the previous metrics, our proposed metrics allow threshold-independent algorithm comparisons.

[1]  P G Singh Perceptual organization of complex-tone sequences: a tradeoff between pitch and timbre? , 1987, The Journal of the Acoustical Society of America.

[2]  R. Fay,et al.  Pitch : neural coding and perception , 2005 .

[3]  Justin Salamon,et al.  Deep Salience Representations for F0 Estimation in Polyphonic Music , 2017, ISMIR.

[4]  Belén Nieto,et al.  Addressing user satisfaction in melody extraction , 2014 .

[5]  Matthias Mauch,et al.  MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research , 2014, ISMIR.

[6]  Hideki Kawahara,et al.  Comparative evaluation of F0 estimation algorithms , 2001, INTERSPEECH.

[7]  François Rigaud,et al.  Singing Voice Melody Transcription Using Deep Neural Networks , 2016, ISMIR.

[8]  Sangeun Kum,et al.  Joint Detection and Classification of Singing Voice Melody Using Convolutional Recurrent Neural Networks , 2019, Applied Sciences.

[9]  Gaël Richard,et al.  Source/Filter Model for Unsupervised Main Melody Extraction From Polyphonic Audio Signals , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Vipul Arora,et al.  On-Line Melody Extraction From Polyphonic Audio Using Harmonic Cluster Tracking , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Emilia Gómez,et al.  Supplementary Graphs: Sinusoid Extraction and Salience Function Design for Predominant Melody Estimation , 2011 .

[12]  Simon Dixon,et al.  PYIN: A fundamental frequency estimator using probabilistic threshold distributions , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Jong Wook Kim,et al.  Crepe: A Convolutional Representation for Pitch Estimation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Daniel P. W. Ellis,et al.  Melody Extraction from Polyphonic Music Signals: Approaches, applications, and challenges , 2014, IEEE Signal Processing Magazine.

[15]  Emilia Gómez,et al.  Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Juan José,et al.  From Heuristics-Based to Data-Driven Audio Melody Extraction , 2017 .

[17]  A. Bregman,et al.  Demonstrations of auditory scene analysis : the perceptual organization of sound , 1995 .

[18]  Slim Essid,et al.  Melody Extraction by Contour Classification , 2015, ISMIR.

[19]  W. R. Garner,et al.  Pitch characteristics of short tones; pitch as a function of tonal duration. , 1948, Journal of experimental psychology.

[20]  Karin Dressler,et al.  Towards Computational Auditory Scene Analysis: Melody Extraction from Polyphonic Music , 2012 .

[21]  Influence of Phase Coherence upon the Pitch of Complex, Periodic Sounds , 1955 .

[22]  Emilia Gómez,et al.  Evaluation and combination of pitch estimation methods for melody extraction in symphonic classical music , 2016 .

[23]  Graham E. Poliner,et al.  Melody Transcription From Music Audio: Approaches and Evaluation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Prateek Verma,et al.  Frequency Estimation from Waveforms Using Multi-Layered Neural Networks , 2016, INTERSPEECH.

[25]  Yi-Hsuan Yang,et al.  Vocal activity informed singing voice separation with the iKala dataset , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Slim Essid,et al.  MAIN MELODY EXTRACTION WITH SOURCE-FILTER NMF AND CRNN , 2018 .

[27]  Emilia Gómez,et al.  A Comparison of Melody Extraction Methods Based on Source-Filter Modelling , 2016, ISMIR.

[28]  Antoine Liutkus,et al.  Probabilistic model for main melody extraction using Constant-Q transform , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).