Singing Voice Separation and Vocal F0 Estimation Based on Mutual Combination of Robust Principal Component Analysis and Subharmonic Summation

This paper presents a new method of singing voice analysis that performs mutually-dependent singing voice separation and vocal fundamental frequency (F0) estimation. Vocal F0 estimation is considered to become easier if singing voices can be separated from a music audio signal, and vocal F0 contours are useful for singing voice separation. This calls for an approach that improves the performance of each of these tasks by using the results of the other. The proposed method first performs robust principal component analysis (RPCA) for roughly extracting singing voices from a target music audio signal. The F0 contour of the main melody is then estimated from the separated singing voices by finding the optimal temporal path over an F0 saliency spectrogram. Finally, the singing voices are separated again more accurately by combining a conventional time-frequency mask given by RPCA with another mask that passes only the harmonic structures of the estimated F0s. Experimental results showed that the proposed method significantly improved the performances of both singing voice separation and vocal F0 estimation. The proposed method also outperformed all the other methods of singing voice separation submitted to an international music analysis competition called MIREX 2014.

[1]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Tuomas Virtanen,et al.  Combining pitch-based inference and non-negative spectrogram factorization in separating vocals from polyphonic music , 2008, SAPA@INTERSPEECH.

[3]  Paris Smaragdis,et al.  Singing-voice separation from monaural recordings using robust principal component analysis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  D. J. Hermes,et al.  Measurement of pitch by subharmonic summation. , 1988, The Journal of the Acoustical Society of America.

[5]  Maximo Cobos,et al.  Singing Voice Separation from Stereo Recordings Using Spatial Clues and Robust F0 Estimation , 2011, Semantic Audio.

[6]  Masataka Goto Active Music Listening Interfaces Based on Signal Processing , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[7]  Changshui Zhang,et al.  Multiple Fundamental Frequency Estimation by Modeling Spectral Peaks and Non-Peak Regions , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Hiromasa Fujihara,et al.  LyricSynchronizer: Automatic Synchronization System Between Musical Audio Signals and Lyrics , 2011, IEEE Journal of Selected Topics in Signal Processing.

[9]  Antoine Liutkus,et al.  Kernel Additive Models for Source Separation , 2014, IEEE Transactions on Signal Processing.

[10]  Karin Dressler,et al.  An Auditory Streaming Approach for Melody Extraction from Polyphonic Music , 2011, ISMIR.

[11]  Preeti Rao,et al.  Vocal Melody Extraction in the Presence of Pitched Accompaniment in Polyphonic Music , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Jen-Tzung Chien,et al.  Bayesian Singing-Voice Separation , 2014, ISMIR.

[13]  Shigeki Sagayama,et al.  Singing Voice Enhancement in Monaural Music Signals Based on Two-stage Harmonic/Percussive Sound Separation on Multiple Resolution Spectrograms , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Gerhard Widmer,et al.  A low-latency, real-time-capable singing voice detection method with LSTM recurrent neural networks , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[15]  Matthias Mauch,et al.  MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research , 2014, ISMIR.

[16]  Daniel P. W. Ellis,et al.  Melody Extraction from Polyphonic Music Signals: Approaches, applications, and challenges , 2014, IEEE Signal Processing Magazine.

[17]  Paris Smaragdis,et al.  Singing-Voice Separation from Monaural Recordings using Deep Recurrent Neural Networks , 2014, ISMIR.

[18]  DeLiang Wang,et al.  Separation of Singing Voice From Music Accompaniment for Monaural Recordings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Vipul Arora,et al.  On-Line Melody Extraction From Polyphonic Audio Using Harmonic Cluster Tracking , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  C Palmer,et al.  Pitch and temporal contributions to musical phrase perception: Effects of harmony, performance timing, and familiarity , 1987, Perception & psychophysics.

[21]  G. Sapiro,et al.  A collaborative framework for 3D alignment and classification of heterogeneous subvolumes in cryo-electron tomography. , 2013, Journal of structural biology.

[22]  Masataka Goto,et al.  RWC Music Database: Popular, Classical and Jazz Music Databases , 2002, ISMIR.

[23]  DeLiang Wang,et al.  A Tandem Algorithm for Singing Pitch Extraction and Voice Separation From Music Accompaniment , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Anders Friberg,et al.  Recognition of the Main Melody in a Polyphonic Symbolic Score using Perceptual Knowledge , 2009 .

[25]  Meinard Müller,et al.  Extracting singing voice from music recordings by cascading audio decomposition techniques , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Masataka Goto,et al.  A real-time music-scene-description system: predominant-F0 estimation for detecting melody and bass lines in real-world audio signals , 2004, Speech Commun..

[27]  Hiromasa Fujihara,et al.  A Modeling of Singing Voice Robust to Accompaniment Sounds and Its Application to Singer Identification and Vocal-Timbre-Similarity-Based Music Information Retrieval , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Yi Ma,et al.  The Augmented Lagrange Multiplier Method for Exact Recovery of Corrupted Low-Rank Matrices , 2010, Journal of structural biology.

[29]  Hideki Kawahara,et al.  Tandem-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Kyogu Lee,et al.  Vocal Separation from Monaural Music Using Temporal/Spectral Continuity and Sparsity Constraints , 2014, IEEE Signal Processing Letters.

[31]  Hirokazu Kameoka,et al.  Mixture of Gaussian process experts for predicting sung melodic contour with expressive dynamic fluctuations , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Derry Fitzgerald,et al.  Single Channel Vocal Separation using Median Filtering and Factorisation Techniques , 2010 .

[33]  Bryan Pardo,et al.  Music/Voice Separation Using the Similarity Matrix , 2012, ISMIR.

[34]  Bryan Pardo,et al.  Combining Rhythm-Based and Pitch-Based Methods for Background and Melody Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[35]  Bryan Pardo,et al.  REpeating Pattern Extraction Technique (REPET): A Simple Method for Music/Voice Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  A. Chanrungutai,et al.  Singing voice separation for mono-channel music using Non-negative Matrix Factorization , 2008, 2008 International Conference on Advanced Technologies for Communications.

[37]  Gaël Richard,et al.  A Musically Motivated Mid-Level Representation for Pitch Estimation and Musical Audio Source Separation , 2011, IEEE Journal of Selected Topics in Signal Processing.

[38]  Ruijiang Li,et al.  Multi-Stage Non-Negative Matrix Factorization for Monaural Singing Voice Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[39]  Jyh-Shing Roger Jang,et al.  A hybrid approach to singing pitch extraction based on trend estimation and hidden Markov models , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Jian Liu,et al.  Singing Melody Extraction in Polyphonic Music by Harmonic Tracking , 2007, ISMIR.

[41]  Gaël Richard,et al.  Vocal detection in music with support vector machines , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[42]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[43]  Hiromasa Fujihara,et al.  Concurrent estimation of singing voice F0 and phonemes by using spectral envelopes estimated from polyphonic music , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Emilia Gómez,et al.  Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[45]  Jyh-Shing Roger Jang,et al.  Singing Pitch Extraction by Voice Vibrato / Tremolo Estimation and Instrument Partial Deletion , 2010, ISMIR.

[46]  Frederick Z. Yen,et al.  Singing Voice Separation Using Spectro-Temporal Modulation Features , 2014, ISMIR.