Multi-Stage Non-Negative Matrix Factorization for Monaural Singing Voice Separation

Separating singing voice from music accompaniment can be of interest for many applications such as melody extraction, singer identification, lyrics alignment and recognition, and content-based music retrieval. In this paper, a novel algorithm for singing voice separation in monaural mixtures is proposed. The algorithm consists of two stages, where non-negative matrix factorization (NMF) is applied to decompose the mixture spectrograms with long and short windows respectively. A spectral discontinuity thresholding method is devised for the long-window NMF to select out NMF components originating from pitched instrumental sounds, and a temporal discontinuity thresholding method is designed for the short-window NMF to pick out NMF components that are from percussive sounds. By eliminating the selected components, most pitched and percussive elements of the music accompaniment are filtered out from the input sound mixture, with little effect on the singing voice. Extensive testing on the MIR-1K public dataset of 1000 short audio clips and the Beach-Boys dataset of 14 full-track real-world songs showed that the proposed algorithm is both effective and efficient.

[1]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[2]  Antoine Liutkus,et al.  Adaptive filtering for music/voice separation exploiting the repeating musical structure , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Youngmoo E. Kim Singing voice analysis/synthesis , 2003 .

[4]  Paris Smaragdis,et al.  Singing-voice separation from monaural recordings using robust principal component analysis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Guillermo Sapiro,et al.  Real-time Online Singing Voice Separation from Monaural Recordings Using Robust Low-rank Modeling , 2012, ISMIR.

[6]  Nancy Bertin,et al.  Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis , 2009, Neural Computation.

[7]  Anssi Klapuri,et al.  Accompaniment separation and karaoke application based on automatic melody transcription , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[8]  Gaël Richard,et al.  A Musically Motivated Mid-Level Representation for Pitch Estimation and Musical Audio Source Separation , 2011, IEEE Journal of Selected Topics in Signal Processing.

[9]  Hiromasa Fujihara,et al.  A Music Information Retrieval System Based on Singing Voice Timbre , 2007, ISMIR.

[10]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[11]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Tuomas Virtanen,et al.  Combining pitch-based inference and non-negative spectrogram factorization in separating vocals from polyphonic music , 2008, SAPA@INTERSPEECH.

[13]  Seungjin Choi,et al.  Independent Component Analysis , 2009, Handbook of Natural Computing.

[14]  Ching-Wei Chen,et al.  Improving melody extraction using Probabilistic Latent Component Analysis , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Shigeki Sagayama,et al.  Melody line estimation in homophonic music audio signals based on temporal-variability of melodic source , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  DeLiang Wang,et al.  A Tandem Algorithm for Singing Pitch Extraction and Voice Separation From Music Accompaniment , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Tuomas Virtanen,et al.  Automatic Recognition of Lyrics in Singing , 2010, EURASIP J. Audio Speech Music. Process..

[18]  Hiromasa Fujihara,et al.  Automatic Synchronization between Lyrics and Music CD Recordings Based on Viterbi Alignment of Segregated Vocal Signals , 2006, Eighth IEEE International Symposium on Multimedia (ISM'06).

[19]  Derry Fitzgerald,et al.  Single Channel Vocal Separation using Median Filtering and Factorisation Techniques , 2010 .

[20]  Bryan Pardo,et al.  Music/Voice Separation Using the Similarity Matrix , 2012, ISMIR.

[21]  Jyh-Shing Roger Jang,et al.  On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[24]  Rémi Gribonval,et al.  Adaptation of Bayesian Models for Single-Channel Source Separation and its Application to Voice/Music Separation in Popular Songs , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Hirokazu Kameoka,et al.  A Real-time Equalizer of Harmonic and Percussive Components in Music Signals , 2008, ISMIR.

[26]  Yi-Hsuan Yang,et al.  On sparse and low-rank matrix decomposition for singing voice separation , 2012, ACM Multimedia.

[27]  Hiromasa Fujihara,et al.  A Modeling of Singing Voice Robust to Accompaniment Sounds and Its Application to Singer Identification and Vocal-Timbre-Similarity-Based Music Information Retrieval , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  DeLiang Wang,et al.  Separation of Singing Voice From Music Accompaniment for Monaural Recordings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  P. Smaragdis,et al.  Non-negative matrix factorization for polyphonic music transcription , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[30]  Emmanuel Vincent,et al.  A General Flexible Framework for the Handling of Prior Information in Audio Source Separation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Xindong Wu,et al.  A new descriptive clustering algorithm based on Nonnegative Matrix Factorization , 2008, 2008 IEEE International Conference on Granular Computing.

[32]  Derry Fitzgerald,et al.  Harmonic/Percussive Separation Using Median Filtering , 2010 .

[33]  Derry Fitzgerald Vocal separation using nearest neighbours and median filtering , 2012 .

[34]  Shankar Vembu,et al.  Separation of Vocals from Polyphonic Audio Recordings , 2005, ISMIR.

[35]  Bryan Pardo,et al.  REpeating Pattern Extraction Technique (REPET): A Simple Method for Music/Voice Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.