Multi-Frame Amplitude Envelope Estimation for Modification of Singing Voice

Singing voice synthesis benefits from very high quality estimation of the resonances and anti-resonances of the vocal tract filter (VTF), i.e., an amplitude spectral envelope. In the state of the art, a single frame of DFT transform is commonly used as a basis for building spectral envelopes. Even though multiple frame analysis (MFA) has already been suggested for envelope estimation, it is not yet used in concrete applications. Indeed, even though existing attempts have shown very interesting results, we will demonstrate that they are either over complicated or fail to satisfy the high accuracy that is necessary for singing voice. In order to allow future applications of MFA, this article aims to improve the theoretical understanding and advantages of MFA-based methods. The use of singing voice signals is very beneficial for studying MFA methods due to the fact that the VTF configuration can be relatively stable and, at the same time, the vibrato creates a regular variation that is easy to model. By simplifying and extending previous works, we also suggest and describe two MFA-based methods. To better understand the behaviors of the envelope estimates, we designed numerical measurements to assess single frame analysis and MFA methods using synthetic signals. With listening tests, we also designed two proofs of concept using pitch scaling and conversion of timbre. Both evaluations show clear and positive results for MFA-based methods, thus, encouraging this research direction for future applications.

[1]  Joe Wolfe,et al.  Vocal tract resonances in singing: the soprano voice. , 2004, The Journal of the Acoustical Society of America.

[2]  Yoshihiko Nankaku,et al.  HMM-Based singing voice synthesis and its application to Japanese and English , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Axel Röbel,et al.  Phase vocoder and beyond , 2013 .

[4]  John Kane,et al.  COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Simon King,et al.  Estimating the spectral envelope of voiced speech using multi-frame analysis , 2003, INTERSPEECH.

[6]  Yannis Stylianou,et al.  Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification , 1996 .

[7]  Nathalie Henrich Bernardoni,et al.  The spectrum of glottal flow models , 2006 .

[8]  Yannis Stylianou,et al.  Pitch modifications of speech based on an adaptive Harmonic Model , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Xavier Rodet,et al.  The Role of FM-induced AM in Dynamic Spectral Profile Analysis , 1988 .

[10]  Stephen A. Dyer,et al.  Digital signal processing , 2018, 8th International Multitopic Conference, 2004. Proceedings of INMIC 2004..

[11]  Jordi Bonada,et al.  Voice Processing and synthesis by performance sampling and spectral models , 2009 .

[12]  Vesa Välimäki,et al.  True discrete cepstrum: An accurate and smooth spectral envelope estimation for music processing , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Tomoki Toda,et al.  Trajectory training considering global variance for HMM-based speech synthesis , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[15]  R. R. Riesz Differential Intensity Sensitivity of the Ear for Pure Tones , 1928 .

[16]  Eric Moulines,et al.  Estimation of the spectral envelope of voiced sounds using a penalized likelihood approach , 2001, IEEE Trans. Speech Audio Process..

[17]  Daniel Erro,et al.  A uniform phase representation for the harmonic model in speech synthesis applications , 2014, EURASIP J. Audio Speech Music. Process..

[18]  Axel Röbel,et al.  Mixed source model and its adapted vocal tract filter estimate for voice transformation and synthesis , 2013, Speech Commun..

[19]  Axel Röbel,et al.  On cepstral and all-pole based spectral envelope modeling with unknown model order , 2007, Pattern Recognit. Lett..

[20]  Axel Röbel,et al.  Shape-invariant speech transformation with the phase vocoder , 2010, INTERSPEECH.

[21]  M. R. Schroeder,et al.  Perception of Coloration in Filtered Gaussian Noise—Short‐Time Spectral Analysis by the Ear , 1962 .

[22]  X. Rodet EFFICIENT SPECTRAL ENVELOPE ESTIMATION AND ITS APPLICATION TO PITCH SHIFTING AND ENVELOPE PRESERVATION , 2005 .

[23]  Amro El-Jaroudi,et al.  Discrete all-pole modeling , 1991, IEEE Trans. Signal Process..

[24]  Hideki Kenmochi,et al.  VOCALOID - commercial singing synthesizer based on sample concatenation , 2007, INTERSPEECH.

[25]  D. M. Green,et al.  Intensity discrimination as a function of frequency and sensation level. , 1977, The Journal of the Acoustical Society of America.

[26]  Thomas F. Quatieri,et al.  High-Pitch Formant Estimation by Exploiting Temporal Change of Pitch , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  D. Paul The spectral envelope estimation vocoder , 1981 .

[28]  T.H. Crystal,et al.  Linear prediction of speech , 1977, Proceedings of the IEEE.

[29]  Gilles Degottex A Time Regularization Technique for Discrete Spectral Envelopes Through Frequency Derivative , 2015, IEEE Signal Processing Letters.

[30]  X. Rodet,et al.  Generalized Discrete Cepstral Analysis for Decorrvolution of Source-Filter System with Discrete Spectra , 1991, Final Program and Paper Summaries 1991 IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics.

[31]  Axel Röbel,et al.  A multi-layer F0 model for singing voice synthesis using a b-spline representation with intuitive controls , 2015, INTERSPEECH.

[32]  Shinji Maeda,et al.  A digital simulation method of the vocal-tract system , 1982, Speech Commun..