Singing Voice Separation from Monaural Recordings

Separating singing voice from music accompaniment has wide applications in areas such as automatic lyrics recognition and alignment, singer identification, and music information retrieval. Compared to the extensive studies of speech separation, singing voice separation has been little explored. We propose a system to separate singing voice from music accompaniment from monaural recordings. The system has three stages. The singing voice detection stage partitions and classifies an input into vocal and non-vocal portions. Then the predominant pitch detection stage detects the pitch contour of the singing voice for vocal portions. Finally the separation stage uses the detected pitch contour to group the time-frequency segments of the singing voice. Quantitative results show that the system performs well in singing voice separation.

[1]  M. Davies,et al.  Complex domain onset detection for musical signals , 2003 .

[2]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[3]  Daniel P. W. Ellis,et al.  Locating singing voice segments within music signals , 2001, Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575).

[4]  P. Boersma Praat : doing phonetics by computer (version 4.4.24) , 2006 .

[5]  David K. Mellinger,et al.  Event formation and separation in musical sound , 1992 .

[6]  B. Moore An Introduction to the Psychology of Hearing , 1977 .

[7]  DeLiang Wang,et al.  Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, IEEE Transactions on Neural Networks.

[8]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[9]  DeLiang Wang,et al.  Detecting pitch of singing voice in polyphonic audio , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[10]  A. Bregman Auditory Scene Analysis , 2008 .

[11]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[12]  Guy J. Brown,et al.  A multi-pitch tracking algorithm for noisy speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Guy J. Brown,et al.  A blackboard architecture for computational auditory scene analysis , 1999, Speech Commun..

[14]  Anssi Klapuri,et al.  Multiple fundamental frequency estimation based on harmonicity and spectral smoothness , 2003, IEEE Trans. Speech Audio Process..