Analysis and HMM-based synthesis of hypo and hyperarticulated speech

Hypo and hyperarticulation refer to the production of speech with respectively a reduction and an increase of the articulatory efforts compared to the neutral style. Produced consciously or not, these variations of articulatory efforts depend upon the surrounding environment, the communication context and the motivation of the speaker with regard to the listener. The goal of this work is to integrate hypo and hyperarticulation into speech synthesizers, such that they are more realistic by automatically adapting their way of speaking to the contextual situation, like humans do. Based on our preliminary work, this paper provides a thorough and detailed study on the analysis and synthesis of hypo and hyperarticulated speech. It is divided into three parts. In the first one, we focus on both acoustic and phonetic modifications due to articulatory effort changes. The second part aims at developing a HMM-based speech synthesizer allowing a continuous control of the degree of articulation. This requires to first tackle the issue of speaking style adaptation to derive hypo and hyperarticulated speech from the neutral synthesizer. Once this is done, an interpolation and extrapolation of the resulting models enables to finely tune the voice so that it is generated with the desired articulatory efforts. Finally the third and last part focuses on a perceptual study of speech with a variable articulation degree, where it is analyzed how intelligibility and various other voice dimensions are affected.

[1]  Enrico Zovato,et al.  Speech synthesis enhancement in noisy environments , 2007, INTERSPEECH.

[2]  Louis Goldstein,et al.  Gesture, Segment, Prosody: “Targetless” schwa: an articulatory analysis , 1992 .

[3]  Roger K. Moore,et al.  C2H: A Computational Model of H&H-based Phonetic Contrast in Synthetic Speech , 2012, INTERSPEECH.

[4]  Junichi Yamagishi,et al.  Average-Voice-Based Speech Synthesis , 2006 .

[5]  Yannis Stylianou,et al.  Improving the modeling of the noise part in the harmonic plus noise model of speech , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Christophe d’Alessandro,et al.  Voice Source Parameters and Prosodic Analysis , 2006 .

[7]  Xavier Rodet,et al.  Speech Rates in French Expressive Speech , 2006 .

[8]  D. Klatt,et al.  Analysis, synthesis, and perception of voice quality variations among female and male talkers. , 1990, The Journal of the Acoustical Society of America.

[9]  Heiga Zen,et al.  Cepstral analysis based on the glimpse proportion measure for improving the intelligibility of HMM-based synthetic speech in noise , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Simon King,et al.  Can Objective Measures Predict the Intelligibility of Modified HMM-Based Synthetic Speech in Noise? , 2011, INTERSPEECH.

[11]  Martin Cooke,et al.  The contribution of changes in F0 and spectral tilt to increased intelligibility of speech produced in noise , 2009, Speech Commun..

[12]  Johan Wouters Analysis and synthesis of degree of articulation , 2001 .

[13]  D G Childers,et al.  Vocal quality factors: analysis, synthesis, and perception. , 1991, The Journal of the Acoustical Society of America.

[14]  Thierry Dutoit,et al.  Analysis and synthesis of hypo- and hyperarticulated speech , 2010, SSW.

[15]  B Hammarberg,et al.  Glottal closure, transglottal airflow, and voice quality in healthy middle-aged women. , 1995, Journal of voice : official journal of the Voice Foundation.

[16]  R. H. Bernacki,et al.  Effects of noise on speech production: acoustic and perceptual analyses. , 1988, The Journal of the Acoustical Society of America.

[17]  Thierry Dutoit,et al.  The Deterministic Plus Stochastic Model of the Residual Signal and Its Applications , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Keiichi Tokuda,et al.  Mel-generalized cepstral analysis - a unified approach to speech spectral estimation , 1994, ICSLP.

[19]  Grégory Beller,et al.  Analyse et modèle génératif de l'expressivité : application à la Parole et à l'Interprétation musicale. (Analysis and Generative Model of the Expressivity. Application in the Speech and in the Musical Performance) , 2009 .

[20]  K. Paliwal,et al.  Efficient vector quantization of LPC parameters at 24 bits/frame , 1990 .

[21]  Thierry Dutoit,et al.  A comparative study of glottal source estimation techniques , 2019, Comput. Speech Lang..

[22]  Jonas Beskow,et al.  Wavesurfer - an open source speech tool , 2000, INTERSPEECH.

[23]  Alan W. Black,et al.  Creating a database of speech in noise for unit selection synthesis , 2004, SSW.

[24]  Lori Lamel,et al.  PRONUNCIATION VARIANTS IN FRENCH : SCHWA & LIAISON , 1999 .

[25]  Christophe d'Alessandro,et al.  A joint intelligibility evaluation of French text-to-speech synthesis systems: the EvaSy SUS/ACR campaign , 2006, LREC.

[26]  Ann R Bradlow,et al.  Variability in Word Duration as a Function of Probability, Speech Style, and Prosody , 2009, Language and speech.

[27]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[28]  Christian Benoît,et al.  An intelligibility test using semantically unpredictable sentences: towards the quantification of linguistic complexity , 1990, Speech Commun..

[29]  Paavo Alku,et al.  Analysis of HMM-Based Lombard Speech Synthesis , 2011, INTERSPEECH.

[30]  S Oviatt,et al.  Modeling global and focal hyperarticulation during human-computer error resolution. , 1998, The Journal of the Acoustical Society of America.

[31]  Heiga Zen,et al.  Hidden Semi-Markov Model Based Speech Synthesis System , 2006 .

[32]  Yan Tang,et al.  Energy reallocation strategies for speech enhancement in known noise conditions , 2010, INTERSPEECH.

[33]  Marion Dohen,et al.  An acoustic and articulatory study of Lombard speech: global effects on the utterance , 2006, INTERSPEECH.

[34]  Bayya Yegnanarayana,et al.  Analysis of glottal stops in speech signals , 2008, INTERSPEECH.

[35]  Marianne L. Borroff A landmark underspecification account of the patterning of glottal stop , 2007 .

[36]  Jean Christophe Verstraeh Frequency and the emergence of linguistic structure , 2005 .

[37]  Yannis Stylianou,et al.  Applying the harmonic plus noise model in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[38]  Takao Kobayashi,et al.  Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training , 2007, IEICE Trans. Inf. Syst..

[39]  Milos Cernak Unit Selection Speech Synthesis in Noise , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[40]  Rupal Patel,et al.  Loudmouth:: modifying text-to-speech synthesis in noise , 2006, Assets '06.

[41]  Bratislava Slovakia,et al.  UNIT SELECTION SPEECH SYNTHESIS IN NOISE , 2006 .

[42]  Thierry Dutoit,et al.  Perceptual Effects of the Degree of Articulation in HMM-Based Speech Synthesis , 2011, NOLISP.

[43]  Nicolas Obin,et al.  Articulation Degree as a Prosodic Dimension of Expressive Speech , 2008 .

[44]  J C Junqua,et al.  The Lombard reflex and its role on human listeners and automatic speech recognizers. , 1993, The Journal of the Acoustical Society of America.

[45]  Peter Ladefoged,et al.  Phonation types: a cross-linguistic overview , 2001, J. Phonetics.

[46]  Yoshihiko Nankaku,et al.  Factor analyzed voice models for HMM-based speech synthesis , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[47]  Alan W. Black,et al.  Improving the understandability of speech synthesis by modeling speech in noise , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[48]  Björn Lindblom,et al.  Economy of Speech Gestures , 1983 .

[49]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[50]  Heiga Zen,et al.  Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[51]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[52]  Junichi Yamagishi,et al.  Modeling and interpolation of Austrian German and Viennese dialect in HMM-based speech synthesis , 2010, Speech Commun..

[53]  B. Picart,et al.  Assessing the Intelligibility and Quality of HMM-based Speech Synthesis with a Variable Degree of Articulation , 2012 .

[54]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[55]  Thierry Dutoit,et al.  Causal-anticausal decomposition of speech using complex cepstrum for glottal source estimation , 2011, Speech Commun..

[56]  Thierry Dutoit,et al.  Glottal-based analysis of the lombard effect , 2010, INTERSPEECH.

[57]  Alan W. Black,et al.  Improving speech synthesis for noisy environments , 2010, SSW.

[58]  Eric Keller,et al.  The Analysis of Voice Quality in Speech Processing , 2004, Summer School on Neural Networks.

[59]  Inma Hernáez,et al.  Implementation of Simple Spectral Techniques to Enhance the Intelligibility of Speech using a Harmonic Model , 2012, INTERSPEECH.

[60]  J. Yamagishi,et al.  HMM-Based Speech Synthesis with Various Speaking Styles Using Model Interpolation , 2004 .

[61]  Mark Liberman,et al.  Towards an integrated understanding of speaking rate in conversation , 2006, INTERSPEECH.

[62]  J. Liljencrants,et al.  Dept. for Speech, Music and Hearing Quarterly Progress and Status Report a Four-parameter Model of Glottal Flow , 2022 .

[63]  Takashi Nose,et al.  HMM-Based Style Control for Expressive Speech Synthesis with Arbitrary Speaker's Voice Using Model Adaptation , 2009, IEICE Trans. Inf. Syst..

[64]  Thierry Dutoit,et al.  MIXED-PHASE SPEECH MODELING AND FORMANT ESTIMATION , USING DIFFERENTIAL PHASE SPECTRUMS , 2003 .

[65]  Thierry Dutoit,et al.  Phonetic alignment: speech synthesis-based vs. Viterbi-based , 2003, Speech Commun..

[66]  William D. Raymond,et al.  Probabilistic Relations between Words: Evidence from Reduction in Lexical Production , 2008 .

[67]  Keiichi Tokuda,et al.  Speaker interpolation in HMM-based speech synthesis system , 1997, EUROSPEECH.

[68]  Thierry Dutoit,et al.  Continuous Control of the Degree of Articulation in HMM-Based Speech Synthesis , 2011, INTERSPEECH.

[69]  Takao Kobayashi,et al.  Modeling of various speaking styles and emotions for HMM-based speech synthesis , 2003, INTERSPEECH.

[70]  John Laver,et al.  Principles of Phonetics: Principles of transcription , 1994 .

[71]  Keiichi Tokuda,et al.  Adaptation of pitch and spectrum for HMM-based speech synthesis using MLLR , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[72]  Marion Dohen,et al.  The Lombard Effect: a physiological reflex or a controlled intelligibility enhancement? , 2006 .

[73]  Anne-Catherine Simon,et al.  A Model for Varying Speaking Style in TTS systems , 2010 .

[74]  Simon King,et al.  Mel cepstral coefficient modification based on the Glimpse Proportion measure for improving the intelligibility of HMM-generated synthetic speech in noise , 2012, INTERSPEECH.