Modeling formant dynamics in speech spectral envelopes

The spectral envelope of a speech signal encodes information about the characteristics of the speech source. As a result, spectral envelope modeling is a central task in speech applications, where tracking temporal transitions in diphones and triphones is essential for efficient speech synthesis and recognition algorithms. Temporal changes in the envelope structure are often derived from estimated formant tracks, an approach which is sensitive to estimation errors. In this paper we propose a speech source model which estimates frequency and amplitude movements in the spectral envelopes of speech signals and does not rely on formant tracking. The proposed model estimates the amplitude and frequency shifts for each sub-band and time frame of a speech signal using the information from the previous time frame. Our experiments demonstrate that the model captures temporal structures of spectral envelopes with high precision. The proposed model can thus be applied as an accurate low-order representation of temporal dynamics in speech spectral envelopes.

[1]  D. Kewley-Port,et al.  Vowel intelligibility in clear and conversational speech for normal-hearing and hearing-impaired listeners. , 2002, The Journal of the Acoustical Society of America.

[2]  A Lewis,et al.  THE SCIENCE OF SOUND , 1997 .

[3]  Nick Campbell,et al.  Optimising unit selection with voice source and formants in the CHATR speech synthesis system , 1997, EUROSPEECH.

[4]  Murray B. Sachs,et al.  Frequency-shaped amplification changes the neural representation of speech with noise-induced hearing loss , 1998, Hearing Research.

[5]  Joaquín González-Rodríguez,et al.  Linguistically-constrained formant-based i-vectors for automatic speaker recognition , 2016, Speech Commun..

[6]  T. M. Nearey,et al.  Identification of resynthesized /hVd/ utterances: effects of formant contour. , 1999, The Journal of the Acoustical Society of America.

[7]  Jacob Benesty,et al.  Springer handbook of speech processing , 2007, Springer Handbooks.

[8]  Tom Bäckström,et al.  Speech Coding: with Code-Excited Linear Prediction , 2017 .

[9]  Diane Kewley-Port,et al.  Talker differences in clear and conversational speech: acoustic characteristics of vowels. , 2007, Journal of speech, language, and hearing research : JSLHR.

[10]  Thomas F. Quatieri,et al.  Relation of Automatically Extracted Formant Trajectories with Intelligibility Loss and Speaking Rate Decline in Amyotrophic Lateral Sclerosis , 2016, INTERSPEECH.

[11]  Paul Taylor,et al.  Text-to-Speech Synthesis , 2009 .

[12]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[13]  Ian C. Bruce,et al.  Robust Formant Tracking for Continuous Speech With Speaker Variability , 2003, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[15]  W. Marsden I and J , 2012 .