Revisiting spectral envelope recovery from speech sounds generated by periodic excitation

We propose a set of new accurate spectral envelope recovery methods for speech sounds generated by periodic excitation based on a set of interference-free power spectrum representations. The proposed methods outperform our previous spectral recovery models used in legacy-STRAIGHT, TANDEM-STRAIGHT and WORLD VOCODERs. We introduce several design procedures of paired time widows which remove interferences caused by signal periodicity in the time domain or in both time and frequency domains. In addition to this interference-free representation, we introduce post and pre-processing to improve recovery accuracy around spectral peak regions. We conducted a set of evaluation tests using voice production simulator and natural speech samples. Finally, we discuss the application of the proposed method on revising high-quality VOCODERs.

[1]  I. Titze Nonlinear source-filter coupling in phonation: theory. , 2008, The Journal of the Acoustical Society of America.

[2]  HIDEKI KAWAHARA,et al.  Technical foundations of TANDEM-STRAIGHT, a speech analysis, modification and synthesis framework , 2011 .

[3]  F. Itakura,et al.  A statistical method for estimation of speech spectral density and formant frequencies , 1970 .

[4]  Masanori Morise,et al.  Sound quality comparison among high-quality vocoders by using re-synthesized speech , 2018 .

[5]  Tomoki Toda,et al.  Speaker-Dependent WaveNet Vocoder , 2017, INTERSPEECH.

[6]  Matti Karjalainen,et al.  Reverberation Modeling Using Velvet Noise , 2007 .

[7]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[8]  L. H. Anauer,et al.  Speech Analysis and Synthesis by Linear Prediction of the Speech Wave , 2000 .

[9]  Hideki Kawahara,et al.  Inharmonic speech reveals the role of harmonicity in the cocktail party problem , 2018, Nature Communications.

[10]  Tomoki Toda,et al.  Frequency domain variants of velvet noise and their application to speech processing and synthesis: with appendices , 2018, INTERSPEECH.

[11]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[12]  M. Unser Sampling-50 years after Shannon , 2000, Proceedings of the IEEE.

[13]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[14]  Anders Löfqvist,et al.  Toward a consensus on symbolic notation of harmonics, resonances, and formants in vocalization. , 2015, The Journal of the Acoustical Society of America.

[15]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[16]  S. Hayashi,et al.  Design and description of CS-ACELP: a toll quality 8 kb/s speech coder , 1998, IEEE Trans. Speech Audio Process..

[17]  Tomoki Toda,et al.  A New Cosine Series Antialiasing Function and its Application to Aliasing-Free Glottal Source Models for Speech and Singing Synthesis , 2017, INTERSPEECH.

[18]  Masanori Morise,et al.  CheapTrick, a spectral envelope estimator for high-quality speech synthesis , 2015, Speech Commun..

[19]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[20]  Satoshi Imai,et al.  Cepstral analysis synthesis on the mel frequency scale , 1983, ICASSP.

[21]  Yannis Stylianou,et al.  Analysis and Synthesis of Speech Using an Adaptive Full-Band Harmonic Model , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[23]  Eric Moulines,et al.  High-quality speech modification based on a harmonic + noise model , 1995, EUROSPEECH.

[24]  Morise Masanori,et al.  Acoustic measurements using a frequency domain velvet noise and interference-free power spectral representations of periodic sounds , 2018 .

[25]  Amro El-Jaroudi,et al.  Discrete all-pole modeling , 1991, IEEE Trans. Signal Process..

[26]  D G Childers,et al.  Modeling the glottal volume-velocity waveform for three voice types. , 1995, The Journal of the Acoustical Society of America.

[27]  Hideki Kawahara,et al.  Temporally variable multi-aspect N-way morphing based on interference-free speech representations , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

[28]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[29]  Karen Simonyan,et al.  Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders , 2017, ICML.

[30]  Hideki Kawahara,et al.  Nearly defect-free F0 trajectory extraction for expressive speech modifications based on STRAIGHT , 2005, INTERSPEECH.

[31]  J W Hawks,et al.  A formant bandwidth estimation procedure for vowel synthesis [43.72.Ja]. , 1995, The Journal of the Acoustical Society of America.

[32]  A. Oppenheim Speech analysis-synthesis system based on homomorphic filtering. , 1969, The Journal of the Acoustical Society of America.

[33]  Hideki Kawahara,et al.  Tandem-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  Hideki Kawahara,et al.  Auditory morphing based on an elastic perceptual distance metric in an interference-free time-frequency representation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[35]  J. Bonada,et al.  Synthesis of the Singing Voice by Performance Sampling and Spectral Models , 2007, IEEE Signal Processing Magazine.

[36]  Manfred R. Schroeder,et al.  Code-excited linear prediction(CELP): High-quality speech at very low bit rates , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[37]  Michael Unser,et al.  Splines: a perfect fit for signal and image processing , 1999, IEEE Signal Process. Mag..

[38]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[39]  Vesa Välimäki,et al.  A Perceptual Study on Velvet Noise and Its Variants at Different Pulse Densities , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[40]  Thierry Dutoit,et al.  The Deterministic Plus Stochastic Model of the Residual Signal and Its Applications , 2012, IEEE Transactions on Audio, Speech, and Language Processing.