Singing voice analysis and editing based on mutually dependent F0 estimation and source separation

This paper presents a novel framework that improves both vocal fundamental frequency (F0) estimation and singing voice separation by making effective use of the mutual dependency of those two tasks. A typical approach to singing voice separation is to estimate the vocal F0 contour from a target music signal and then extract the singing voice by using a time-frequency mask that passes only the harmonic components of the vocal F0s and overtones. Vocal F0 estimation, on the contrary, is considered to become easier if only the singing voice can be extracted accurately from the target signal. Such mutual dependency has scarcely been focused on in most conventional studies. To overcome this limitation, our framework alternates those two tasks while using the results of each in the other. More specifically, we first extract the singing voice by using robust principal component analysis (RPCA). The F0 contour is then estimated from the separated singing voice by finding the optimal path over a F0-saliency spectrogram based on subharmonic summation (SHS). This enables us to improve singing voice separation by combining a time-frequency mask based on RPCA with a mask based on harmonic structures. Experimental results obtained when we used the proposed technique to directly edit vocal F0s in popular-music audio signals showed that it significantly improved both vocal F0 estimation and singing voice separation.

[1]  Christian Schörkhuber CONSTANT-Q TRANSFORM TOOLBOX FOR MUSIC PROCESSING , 2010 .

[2]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Masataka Goto,et al.  A real-time music-scene-description system: predominant-F0 estimation for detecting melody and bass lines in real-world audio signals , 2004, Speech Commun..

[4]  Gerald Schuller,et al.  Efficient implementation of a system for solo and accompaniment separation in polyphonic music , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[5]  Shigeki Sagayama,et al.  Singing Voice Enhancement in Monaural Music Signals Based on Two-stage Harmonic/Percussive Sound Separation on Multiple Resolution Spectrograms , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Hiromasa Fujihara,et al.  LyricSynchronizer: Automatic Synchronization System Between Musical Audio Signals and Lyrics , 2011, IEEE Journal of Selected Topics in Signal Processing.

[7]  DeLiang Wang,et al.  Separation of singing voice from music accompaniment for monaural recordings , 2007 .

[8]  Emilia Gómez,et al.  Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Preeti Rao,et al.  Vocal Melody Extraction in the Presence of Pitched Accompaniment in Polyphonic Music , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Daniel P. W. Ellis,et al.  Melody Extraction from Polyphonic Music Signals: Approaches, applications, and challenges , 2014, IEEE Signal Processing Magazine.

[11]  Masataka Goto Active Music Listening Interfaces Based on Signal Processing , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[12]  Masataka Goto,et al.  Drumix: An Audio Player with Real-time Drum-part Rearrangement Functions for Active Music Listening , 2007 .

[13]  Lawrence R. Rabiner,et al.  A tutorial on Hidden Markov Models , 1986 .

[14]  DeLiang Wang,et al.  A Tandem Algorithm for Singing Pitch Extraction and Voice Separation From Music Accompaniment , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Mark D. Plumbley,et al.  Score informed audio source separation using constrained nonnegative matrix factorization and score synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Karin Dressler,et al.  An Auditory Streaming Approach for Melody Extraction from Polyphonic Music , 2011, ISMIR.

[17]  Jian Liu,et al.  Singing Melody Extraction in Polyphonic Music by Harmonic Tracking , 2007, ISMIR.

[18]  Yi Ma,et al.  The Augmented Lagrange Multiplier Method for Exact Recovery of Corrupted Low-Rank Matrices , 2010, Journal of structural biology.

[19]  Tuomas Virtanen,et al.  Combining pitch-based inference and non-negative spectrogram factorization in separating vocals from polyphonic music , 2008, SAPA@INTERSPEECH.

[20]  G. Sapiro,et al.  A collaborative framework for 3D alignment and classification of heterogeneous subvolumes in cryo-electron tomography. , 2013, Journal of structural biology.

[21]  D. J. Hermes,et al.  Measurement of pitch by subharmonic summation. , 1988, The Journal of the Acoustical Society of America.

[22]  Hirokazu Kameoka,et al.  Mixture of Gaussian process experts for predicting sung melodic contour with expressive dynamic fluctuations , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Derry Fitzgerald,et al.  Single Channel Vocal Separation using Median Filtering and Factorisation Techniques , 2010 .

[24]  Gaël Richard,et al.  A Musically Motivated Mid-Level Representation for Pitch Estimation and Musical Audio Source Separation , 2011, IEEE Journal of Selected Topics in Signal Processing.

[25]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[26]  Bryan Pardo,et al.  Combining Rhythm-Based and Pitch-Based Methods for Background and Melody Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Bryan Pardo,et al.  REpeating Pattern Extraction Technique (REPET): A Simple Method for Music/Voice Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Jyh-Shing Roger Jang,et al.  Singing Pitch Extraction by Voice Vibrato / Tremolo Estimation and Instrument Partial Deletion , 2010, ISMIR.

[29]  Paris Smaragdis,et al.  Singing-voice separation from monaural recordings using robust principal component analysis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Hideki Kawahara,et al.  Signal reconstruction from modified auditory wavelet transform , 1993, IEEE Trans. Signal Process..

[31]  Masataka Goto,et al.  RWC Music Database: Popular, Classical and Jazz Music Databases , 2002, ISMIR.

[32]  Gautham J. Mysore,et al.  Source Separation of Polyphonic Music with Interactive User-Feedback on a Piano Roll Display , 2013, ISMIR.

[33]  Hiromasa Fujihara,et al.  Concurrent estimation of singing voice F0 and phonemes by using spectral envelopes estimated from polyphonic music , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Gaël Richard,et al.  Vocal detection in music with support vector machines , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.