Mining Labeled Data from Web-Scale Collections for Vocal Activity Detection in Music

This work demonstrates an approach to generating strongly labeled data for vocal activity detection by pairing instrumental versions of songs with their original mixes. Though such pairs are rare, we find ample instances in a massive music collection for training deep convolutional networks at this task, achieving state of the art performance with a fraction of the human effort required previously. Our error analysis reveals two notable insights: imperfect systems may exhibit better temporal precision than human annotators, and should be used to accelerate annotation; and, machine learning from mined data can reveal subtle biases in the data source, leading to a better understanding of the problem itself. We also discuss future directions for the design and evolution of benchmarking datasets to rigorously evaluate AI systems.

[1]  Daniel P. W. Ellis,et al.  Echoprint: An Open Music Identification Service , 2011 .

[2]  Douglas Eck,et al.  Temporal Pooling and Multiscale Learning for Automatic Annotation and Ranking of Music Audio , 2011, ISMIR.

[3]  W. Beyer CRC Standard Probability And Statistics Tables and Formulae , 1990 .

[4]  Gerhard Widmer,et al.  On the reduction of false positives in singing voice detection , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Geoffroy Peeters,et al.  Singing voice detection in music tracks using direct voice vibrato detection , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Gaël Richard,et al.  Vocal detection in music with support vector machines , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Thomas Grill,et al.  Exploring Data Augmentation for Improved Singing Voice Detection with Neural Networks , 2015, ISMIR.

[8]  Hiromasa Fujihara,et al.  Timbre and Melody Features for the Recognition of Vocal Activity and Instrumental Solos in Polyphonic Music , 2011, ISMIR.

[9]  Gerhard Widmer,et al.  A low-latency, real-time-capable singing voice detection method with LSTM recurrent neural networks , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[10]  Roland Badeau,et al.  Singing voice detection with deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Mark D. Plumbley,et al.  Deep Karaoke: Extracting Vocals from Musical Mixtures Using a Convolutional Deep Neural Network , 2015, LVA/ICA.

[12]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Juan Pablo Bello,et al.  Rethinking Automatic Chord Recognition with Convolutional Neural Networks , 2012, 2012 11th International Conference on Machine Learning and Applications.

[14]  Daniel P. W. Ellis,et al.  Large-Scale Content-Based Matching of MIDI and Audio Files , 2015, ISMIR.

[15]  Matthias Mauch,et al.  MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research , 2014, ISMIR.

[16]  Jan Schlüter,et al.  Learning to Pinpoint Singing Voice from Weakly Labeled Examples , 2016, ISMIR.

[17]  Simon Dixon,et al.  PYIN: A fundamental frequency estimator using probabilistic threshold distributions , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Judith C. Brown Calculation of a constant Q spectral transform , 1991 .

[19]  Daniel P. W. Ellis,et al.  Locating singing voice segments within music signals , 2001, Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575).

[20]  James Allan,et al.  Incremental test collections , 2005, CIKM '05.