Single-Channel Blind Source Separation for Singing Voice Detection: A Comparative Study

We propose a novel unsupervised singing voice detection method which use single-channel Blind Audio Source Separation (BASS) algorithm as a preliminary step. To reach this goal, we investigate three promising BASS approaches which operate through a morphological filtering of the analyzed mixture spectrogram. The contributions of this paper are manyfold. First, the investigated BASS methods are reworded with the same formalism and we investigate their respective hyperparameters by numerical simulations. Second, we propose an extension of the KAM method for which we propose a novel training algorithm used to compute a source-specific kernel from a given isolated source signal. Second, the BASS methods are compared together in terms of source separation accuracy and in terms of singing voice detection accuracy when they are used in our new singing voice detection framework. Finally, we do an exhaustive singing voice detection evaluation for which we compare both supervised and unsupervised singing voice detection methods. Our comparison explores different combination of the proposed BASS methods with new features such as the new proposed KAM features and the scattering transform through a machine learning framework and also considers convolutional neural networks methods.

[1]  Roland Badeau,et al.  Nonnegative Tensor Factorization with Frequency Modulation Cues for Blind Audio Source Separation , 2016, ISMIR.

[2]  W. Cleveland,et al.  Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting , 1988 .

[3]  Joakim Andén,et al.  Multiscale Scattering for Audio Classification , 2011, ISMIR.

[4]  Matthias Mauch,et al.  MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research , 2014, ISMIR.

[5]  Jan Schlüter,et al.  Learning to Pinpoint Singing Voice from Weakly Labeled Examples , 2016, ISMIR.

[6]  Derry Fitzgerald,et al.  Harmonic/Percussive Separation Using Median Filtering , 2010 .

[7]  Paris Smaragdis,et al.  Singing-voice separation from monaural recordings using robust principal component analysis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Tom Barker,et al.  Blind Separation of Audio Mixtures Through Nonnegative Tensor Factorization of Modulation Spectrograms , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Jyh-Shing Roger Jang,et al.  On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  C. Yoo,et al.  Underdetermined Blind Source Separation Based on , 2009 .

[11]  Pierre Comon,et al.  Handbook of Blind Source Separation: Independent Component Analysis and Applications , 2010 .

[12]  Bryan Pardo,et al.  REpeating Pattern Extraction Technique (REPET): A Simple Method for Music/Voice Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Emmanuel Vincent,et al.  First Stereo Audio Source Separation Evaluation Campaign: Data, Algorithms and Results , 2007, ICA.

[14]  Gerhard Widmer,et al.  Monaural Blind Source Separation in the Context of Vocal Detection , 2015, ISMIR.

[15]  Yoav Benjamini,et al.  Opening the Box of a Boxplot , 1988 .

[16]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[17]  Antoine Liutkus,et al.  Kernel Additive Models for Source Separation , 2014, IEEE Transactions on Signal Processing.

[18]  Eeters,et al.  Estimation locale des modulations AM / FM : applications à la modélisation sinusoïdale audio et à la séparation de sources aveugle , 2017 .

[19]  Emmanuel Vincent,et al.  Subjective and Objective Quality Assessment of Audio Source Separation , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Sylvain Marchand,et al.  Informed spectral analysis: audio signal parameter estimation using side information , 2013, EURASIP J. Adv. Signal Process..

[21]  Antoine Liutkus,et al.  Adaptive filtering for music/voice separation exploiting the repeating musical structure , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  G. Sapiro,et al.  A collaborative framework for 3D alignment and classification of heterogeneous subvolumes in cryo-electron tomography. , 2013, Journal of structural biology.

[23]  L. Daudet,et al.  Harmonic/percussive separation using Kernel Additive Modelling , 2014 .

[24]  Antoine Liutkus,et al.  Scalable audio separation with light Kernel Additive Modelling , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Kyogu Lee,et al.  Vocal Separation from Monaural Music Using Temporal/Spectral Continuity and Sparsity Constraints , 2014, IEEE Signal Processing Letters.

[26]  Geoffroy Peeters,et al.  Singing voice detection in music tracks using direct voice vibrato detection , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  G. Peeters Automatic Classification of Large Musical Instrument Databases Using Hierarchical Classifiers with Inertia Ratio Maximization , 2003 .

[28]  Patrick Flandrin,et al.  Time-Frequency/Time-Scale Analysis , 1998 .

[29]  J. Idier Bayesian Approach to Inverse Problems: Idier/Bayesian , 2010 .

[30]  Gaël Richard,et al.  Vocal detection in music with support vector machines , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Antoine Liutkus,et al.  Explaining the parameterized wiener filter with alpha-stable processes , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[32]  Katsutoshi Itoyama,et al.  Singing Voice Separation and Vocal F0 Estimation Based on Mutual Combination of Robust Principal Component Analysis and Subharmonic Summation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33]  Yi Ma,et al.  The Augmented Lagrange Multiplier Method for Exact Recovery of Corrupted Low-Rank Matrices , 2010, Journal of structural biology.

[34]  Elliot Creager,et al.  Musical source separation by coherent frequency modulation cues , 2016 .

[35]  Patrick Susini,et al.  The Timbre Toolbox: Audio descriptors of musical signals , 2011 .

[36]  Antoine Liutkus,et al.  Common fate model for unison source separation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Mert Bay,et al.  Evaluation of Multiple-F0 Estimation and Tracking Systems , 2009, ISMIR.

[38]  M. Najim Modeling, Estimation and Optimal Filtering in Signal Processing , 2008 .

[39]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[40]  Özgür Yilmaz,et al.  Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[41]  L. Rudin,et al.  Nonlinear total variation based noise removal algorithms , 1992 .

[42]  Jin Young Kim,et al.  Music/Voice Separation Based on Kernel Back-fitting Using Weighted β-order MMSE Estimation , 2015 .

[43]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[44]  Patrick Flandrin,et al.  Recursive versions of the Levenberg-Marquardt reassigned spectrogram and of the synchrosqueezed STFT , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Jun-Yong Lee,et al.  Singing Voice Separation from Monaural Music Based on Kernel Back-Fitting Using Beta-Order Spectral Amplitude Estimation , 2015, ISMIR.