On sparse and low-rank matrix decomposition for singing voice separation

Over recent years there has been a growing interest in finding ways to transform signals/matrices into sparse or low-rank representations, i.e., representations which are sparse in support or of low redundancy. Such decompositions are proving to be particularly powerful for a variety of signal processing and compression problems. In this paper, we investigate the application of this technique to the challenging task of singing voice/accompaniment separation for popular music. The vocal part is modeled as a sparse signal, whereas the instrumental part is considered to be low-rank. In addition, to better account for the particular properties of music, two new algorithms are proposed to improve the decomposition, including the incorporation of harmonicity priors and a back-end drum removal procedure. Evaluations on the MIR-1K benchmark dataset show that the proposed algorithms outperform the state-of-the-art by 0.01-2.41 db.

[1]  Emmanuel Vincent,et al.  A General Flexible Framework for the Handling of Prior Information in Audio Source Separation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Geoffroy Peeters,et al.  Singing voice detection in music tracks using direct voice vibrato detection , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Dong Liu,et al.  Image retagging , 2010, ACM Multimedia.

[4]  Petri Toiviainen,et al.  MIR in Matlab (II): A Toolbox for Musical Feature Extraction from Audio , 2007, ISMIR.

[5]  Guy J. Brown,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[6]  Paris Smaragdis,et al.  Singing-voice separation from monaural recordings using robust principal component analysis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Homer H. Chen,et al.  Music Emotion Recognition , 2011 .

[8]  Guillermo Sapiro,et al.  Sparse Representation for Computer Vision and Pattern Recognition , 2010, Proceedings of the IEEE.

[9]  Emmanuel J. Candès,et al.  A Singular Value Thresholding Algorithm for Matrix Completion , 2008, SIAM J. Optim..

[10]  Jyh-Shing Roger Jang,et al.  On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Shuicheng Yan,et al.  Graph Embedding and Extensions: A General Framework for Dimensionality Reduction , 2007 .

[12]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[13]  Logan Volkers,et al.  PHASE VOCODER , 2008 .

[14]  Arvind Ganesh,et al.  Fast Convex Optimization Algorithms for Exact Recovery of a Corrupted Low-Rank Matrix , 2009 .

[15]  Zhijian Ou,et al.  Combining HMM-based melody extraction and NMF-based soft masking for separating voice and accompaniment from monaural audio , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Shuicheng Yan,et al.  Graph embedding: a general framework for dimensionality reduction , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[17]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Masataka Goto,et al.  RWC Music Database: Music genre database and musical instrument sound database , 2003, ISMIR.

[19]  Bryan Pardo,et al.  A simple music/voice separation method based on the extraction of the repeating musical structure , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Gaël Richard,et al.  A Musically Motivated Mid-Level Representation for Pitch Estimation and Musical Audio Source Separation , 2011, IEEE Journal of Selected Topics in Signal Processing.

[21]  Hendrik Purwins,et al.  Sparse Approximations for Drum Sound Classification , 2011, IEEE Journal of Selected Topics in Signal Processing.

[22]  Yi-Hsuan Yang,et al.  Music retagging using label propagation and robust principal component analysis , 2012, WWW.

[23]  Changshui Zhang,et al.  Unsupervised Single-Channel Music Source Separation by Average Harmonic Structure Modeling , 2008, IEEE Transactions on Audio, Speech, and Language Processing.