Conversational speech synthesis and the need for some laughter

This paper reports progress in the synthesis of conversational speech, from the viewpoint of work carried out on the analysis of a very large corpus of expressive speech in normal everyday situations. With recent developments in concatenative techniques, speech synthesis has overcome the barrier of realistically portraying extra-linguistic information by using the actual voice of a recognizable person as a source for units, combined with minimal use of signal processing. However, the technology still faces the problem of expressing paralinguistic information, i.e., the variety in the types of speech and laughter that a person might use in everyday social interactions. Paralinguistic modification of an utterance portrays the speaker's affective states and shows his or her relationships with the speaker through variations in the manner of speaking, by means of prosody and voice quality. These inflections are carried on the propositional content of an utterance, and can perhaps be modeled by rule, but they are also expressed through nonverbal utterances, the complexity of which may be beyond the capabilities of many current synthesis methods. We suggest that this problem may be solved by the use of phrase-sized utterance units taken intact from a large corpus

[1]  Y. Sagisaka,et al.  Speech synthesis by rule using an optimal selection of non-uniform synthesis units , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[2]  Petra Wagner,et al.  Speech synthesis development made easy: the bonn open synthesis system , 2001, INTERSPEECH.

[3]  Roman Jakobson,et al.  On Poetic Intentions and Linguistic Devices in Poetry: A Discussion with Professors and Students at the University of Cologne , 1980 .

[4]  R. Jakobson Linguistics and poetics , 1960 .

[5]  J. A. Edwards,et al.  Talking data : transcription and coding in discourse research , 1995 .

[6]  Nick Campbell,et al.  Listening between the lines : a study of paralinguistic information carried by tone-of-voice , 2004 .

[7]  I. Mattingly,et al.  Experimental methods for speech synthesis by rule , 1968 .

[8]  Nick Campbell,et al.  What do People Hear? A Study of the Perception of Non-verbal Affective Information in Conversational Speech( Emotion in Speech) , 2004 .

[9]  J. N. Holmes,et al.  Speech Synthesis by Rule Controlled by a Small, Low‐Speed Digital Computer , 1963 .

[10]  Marc Schröder,et al.  Dimensional Emotion Representation as a Basis for Speech Synthesis with Non-extreme Emotions , 2004, ADS.

[11]  Gunnar Fant,et al.  Acoustic analysis and synthesis of speech with applications to Swedish , 1959 .

[12]  Nick Campbell,et al.  Getting to the Heart of the Matter: Speech as the Expression of Affect; Rather than Just Text or Language , 2005, Lang. Resour. Evaluation.

[13]  M. Liberman,et al.  A set of concatenative units for speech synthesis , 1979 .

[14]  Rolf Carlson,et al.  A text-to-speech system based entirely on rules , 1976, ICASSP.

[15]  Keikichi Hirose,et al.  Speech Synthesis by Rule. , 1996 .

[16]  Nick Campbell Getting to the Heart of the Matter; Speech is More than Just the Expression of Text or Language , 2004, LREC.

[17]  Jonathan Allen,et al.  Text to Speech , 2015 .

[18]  Nick Campbell Specifying Affect and Emotion for Expressive Speech Synthesis , 2004, CICLing.

[19]  Nick Campbell Extra-Semantic Protocols; Input Requirements for the Synthesis of Dialogue Speech , 2004, ADS.

[20]  Kenneth Ward Church Stress assignment in letter to sound rules for speech synthesis , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  Frédéric Bimbot,et al.  Language modeling by variable length sequences: theoretical formulation and evaluation of multigrams , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[22]  Nick Campbell Speech & Expression; the Value of a Longitudinal Corpus , 2004, LREC.

[23]  Yannis Stylianou,et al.  Perceptual and objective detection of discontinuities in concatenative speech synthesis , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[24]  Marc Schröder,et al.  Voice quality interpolation for emotional text-to-speech synthesis , 2005, INTERSPEECH.

[25]  Nick Campbell,et al.  A Speech Synthesis System with Emotion for Assisting Communication , 2000 .

[26]  Ann K. Syrdal Phonetic effects on listener detection of vowel concatenation , 2001, INTERSPEECH.

[27]  Dennis H. Klatt,et al.  The klattalk text-to-speech conversion system , 1982, ICASSP.

[28]  David B. Pisoni,et al.  Text-to-speech: the mitalk system , 1987 .

[29]  Nick Campbell,et al.  Perception of affect in speech - towards an automatic processing of paralinguistic information in spoken conversation , 2004, INTERSPEECH.

[30]  P. Ekman Universals and cultural differences in facial expressions of emotion. , 1972 .

[31]  J. Olive,et al.  Rule synthesis of speech from dyadic units , 1977 .

[32]  Roddy Cowie,et al.  Acoustic correlates of emotion dimensions in view of speech synthesis , 2001, INTERSPEECH.

[33]  Nick Campbell Recording techniques for capturing natural every-day speech , 2002, LREC.

[34]  Paavo Alku,et al.  Amplitude domain quotient for characterization of the glottal volume velocity waveform estimated by inverse filtering , 1996, Speech Commun..

[35]  Joseph Olive,et al.  A scheme for concatenating units for speech synthesis , 1980, ICASSP.

[36]  K. Scherer Psychological models of emotion. , 2000 .

[37]  Nick Campbell,et al.  Optimising selection of units from speech databases for concatenative synthesis , 1995, EUROSPEECH.

[38]  Marc Schröder,et al.  How (Not) to Add Laughter to Synthetic Speech , 2004, ADS.

[39]  Hu Peng,et al.  Selecting non-uniform units from a very large corpus for concatenative speech synthesizer , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[40]  Roddy Cowie,et al.  Beyond emotion archetypes: Databases for emotion modelling using neural networks , 2005, Neural Networks.