Stereophonic music separation based on non-negative tensor factorization with cepstrum regularization

This paper presents a novel approach to stereophonic music separation based on Non-negative Tensor Factorization (NTF). Stereophonic music is roughly divided into two types; recorded music or synthesized music, which we focus on synthesized one in this paper. Synthesized music signals are often generated as linear combinations of many individual source signals with their mixing gains (i.e., time-invariant amplitude scaling) to each channel signal. Therefore, the synthesized stereophonic music separation is the underdetermined source separation problem where phase components are not helpful for the separation. NTF is one of the effective techniques to handle this problem, decomposing amplitude spectrograms of the stereo channel music signal into basis vectors and activations of individual music source signals and their corresponding mixing gains. However, it is essentially difficult to obtain sufficient separation performance in this separation problem as available acoustic cues for separation are limited. To address this issue, we propose a cepstrum regularization method for NTF-based stereo channel separation. The proposed method makes the separated music source signals follow the corresponding Gaussian mixture models of individual music source signals, which are trained in advance using their available samples. An experimental evaluation using real music signals is conducted to investigate the effectiveness of the proposed method in both supervised and unsupervised separation frameworks. The experimental results demonstrate that the proposed method yields significant improvements in separation performance in both frameworks.

[1]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[2]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[3]  Hirokazu Kameoka,et al.  Efficient multichannel nonnegative matrix factorization exploiting rank-1 spatial model , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Bhiksha Raj,et al.  Supervised and Semi-supervised Separation of Sounds from Single-Channel Mixtures , 2007, ICA.

[5]  Alexey Ozerov,et al.  Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Li Li,et al.  Semi-Supervised Joint Enhancement of Spectral and Cepstral Sequences of Noisy Speech , 2016, INTERSPEECH.

[7]  Paris Smaragdis,et al.  Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs , 2004, ICA.

[8]  P. Smaragdis,et al.  Non-negative matrix factorization for polyphonic music transcription , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[9]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Hirokazu Kameoka,et al.  Multichannel Extensions of Non-Negative Matrix Factorization With Complex-Valued Data , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Hirokazu Kameoka,et al.  Constrained and regularized variants of non-negative matrix factorization incorporating music-specific constraints , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Katsutoshi Itoyama,et al.  Singing voice analysis and editing based on mutually dependent F0 estimation and source separation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Shankar Vembu,et al.  Separation of Vocals from Polyphonic Audio Recordings , 2005, ISMIR.

[14]  Hirokazu Kameoka,et al.  Complex NMF: A new sparse representation for acoustic signals , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.