A two-stage singing voice separation algorithm using spectro-temporal modulation features

A two-stage singing voice separation algorithm using spectrotemporal modulation features is proposed in this paper. First, music clips are transformed into auditory spectrograms and the spectral-temporal modulation contents of all time-frequency (T-F) units of the auditory spectrograms are extracted using an auditory model. Then, T-F units are sequentially clustered using the expectation-maximization (EM) algorithm into percussive, harmonic and vocal units through the proposed two-stage algorithm. Lastly, the singing voice is synthesized from clustered vocal T-F units via time-frequency masking. The algorithm was evaluated using the MIR-1K dataset and demonstrated better separation results than our previously proposed one-stage algorithm.

[1]  Frederick Z. Yen,et al.  Singing Voice Separation Using Spectro-Temporal Modulation Features , 2014, ISMIR.

[2]  DeLiang Wang,et al.  Separation of singing voice from music accompaniment for monaural recordings , 2007 .

[3]  Shankar Vembu,et al.  Separation of Vocals from Polyphonic Audio Recordings , 2005, ISMIR.

[4]  Yi-Hsuan Yang,et al.  Low-Rank Representation of Both Singing Voice and Music Accompaniment Via Learned Dictionaries , 2013, ISMIR.

[5]  Bryan Pardo,et al.  REpeating Pattern Extraction Technique (REPET): A Simple Method for Music/Voice Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Paris Smaragdis,et al.  Singing-voice separation from monaural recordings using robust principal component analysis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[9]  Shigeki Sagayama,et al.  Singing Voice Enhancement in Monaural Music Signals Based on Two-stage Harmonic/Percussive Sound Separation on Multiple Resolution Spectrograms , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Guillermo Sapiro,et al.  Real-time Online Singing Voice Separation from Monaural Recordings Using Robust Low-rank Modeling , 2012, ISMIR.

[11]  S Shamma,et al.  The case of the missing pitch templates: how harmonic templates emerge in the early auditory system. , 2000, The Journal of the Acoustical Society of America.

[12]  Jyh-Shing Roger Jang,et al.  On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset , 2010, IEEE Transactions on Audio, Speech, and Language Processing.