Simple multi frame analysis methods for estimation of amplitude spectral envelope estimation in singing voice

In the state of the art, a single frame of DFT transform is commonly used as a basis for building amplitude spectral envelopes. Multiple Frame Analysis (MFA) has already been suggested for envelope estimation, but often with excessive complexity. In this paper, two MFA-based methods are presented: one simplifying an existing Least Square (LS) solution, and another one based on a simple linear interpolation. In the context of singing voice we study sustained segments with vibrato, because these ones are obviously critical for singing voice synthesis. They also provide a convenient context to study, prior to extension of this work in more general contexts. Numerical and perceptual experiments show clear improvements of the two methods described compared to the state of the art and encourage further studies in this research direction.

[1]  Shinji Maeda,et al.  A digital simulation method of the vocal-tract system , 1982, Speech Commun..

[2]  Eric Moulines,et al.  Estimation of the spectral envelope of voiced sounds using a penalized likelihood approach , 2001, IEEE Trans. Speech Audio Process..

[3]  D. Paul The spectral envelope estimation vocoder , 1981 .

[4]  Axel Röbel,et al.  Mixed source model and its adapted vocal tract filter estimate for voice transformation and synthesis , 2013, Speech Commun..

[5]  Yoshihiko Nankaku,et al.  HMM-Based singing voice synthesis and its application to Japanese and English , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Daniel Erro,et al.  A uniform phase representation for the harmonic model in speech synthesis applications , 2014, EURASIP J. Audio Speech Music. Process..

[7]  D. M. Green,et al.  Intensity discrimination as a function of frequency and sensation level. , 1977, The Journal of the Acoustical Society of America.

[8]  Yannis Stylianou,et al.  Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification , 1996 .

[9]  M. R. Schroeder,et al.  Perception of Coloration in Filtered Gaussian Noise—Short‐Time Spectral Analysis by the Ear , 1962 .

[10]  X. Rodet EFFICIENT SPECTRAL ENVELOPE ESTIMATION AND ITS APPLICATION TO PITCH SHIFTING AND ENVELOPE PRESERVATION , 2005 .

[11]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[12]  Stefan Huber,et al.  On the use of voice descriptors for glottal source shape parameter estimation , 2014, Comput. Speech Lang..

[13]  Thomas F. Quatieri,et al.  High-Pitch Formant Estimation by Exploiting Temporal Change of Pitch , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Axel Röbel,et al.  On cepstral and all-pole based spectral envelope modeling with unknown model order , 2007, Pattern Recognit. Lett..

[15]  Axel Röbel,et al.  Shape-invariant speech transformation with the phase vocoder , 2010, INTERSPEECH.

[16]  Axel Röbel,et al.  A multi-layer F0 model for singing voice synthesis using a b-spline representation with intuitive controls , 2015, INTERSPEECH.

[17]  Amro El-Jaroudi,et al.  Discrete all-pole modeling , 1991, IEEE Trans. Signal Process..

[18]  Simon King,et al.  Estimation of voice source and vocal tract characteristics based on multi-frame analysis , 2003, INTERSPEECH.

[19]  Yannis Stylianou,et al.  Pitch modifications of speech based on an adaptive Harmonic Model , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Vesa Välimäki,et al.  True discrete cepstrum: An accurate and smooth spectral envelope estimation for music processing , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Tomoki Toda,et al.  Trajectory training considering global variance for HMM-based speech synthesis , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  T.H. Crystal,et al.  Linear prediction of speech , 1977, Proceedings of the IEEE.

[23]  Gilles Degottex A Time Regularization Technique for Discrete Spectral Envelopes Through Frequency Derivative , 2015, IEEE Signal Processing Letters.

[24]  X. Rodet,et al.  Generalized Discrete Cepstral Analysis for Decorrvolution of Source-Filter System with Discrete Spectra , 1991, Final Program and Paper Summaries 1991 IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics.

[25]  John Kane,et al.  COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Axel Röbel,et al.  Phase vocoder and beyond , 2013 .