DATABASES OF EMOTIONAL SPEECH

This paper presents a personal view of some the problems facing speech technologists in the study of emotional speech. It describes some databases that are currently being used, and points out that the majority of them use actors to reproduce the emotions, thereby possibly falsely representing the true characteristics of emotion in speech. Databases of real emotional speech, on the other hand, present serious ethical and moral problems, since the nature of their contents must, by definition, reveal personal and intimate details about the speakers. 1. OBJECTIVES OF THE PAPER This paper does not set out to provide an inventory of databases available for the study of emotional speech characteristics; the JASA paper by Murray and Arnott [1], the PHYSTA home page [2], and the web pages of Erlangen University and the Salk Institute [3], for example, provide good overviews of such previous work. Instead, the paper presents a personal account of some of the problems facing researchers who wish to study the speech characteristics associated with different emotions. It approaches the issue from the standpoint of speech technology, rather than that of psychology, and describes work planned under a forthcoming JST-funded five-year project for the study of `expressive speech phenomena’ which will include the production of a large-scale emotional-speech database. Rather than present new facts or data, the paper sets out some topics for discussion and raises questions; in the hope that some of the issues may be resolved during the three days of the workshop. 2. EXPRESSIVE SPEECH Linguistic information is all that can be carried by text, but it is only a small part of the spoken message. As humans, when listening to speech, we are sensitive to extra-linguistic information about the identity and the state of the speaker, as well as to paralinguistic information about the speaker’s intentions underlying the utterance. This information is largely missing from computer speech synthesis, and current speech recognition systems make no use of it. In many instances of conversational human communication, the speaker’s intention, signalled by the manner of speech, is as important as the text of the utterance, and in social or phatic communication, often more so. As humans, we have become used to processing such extra-verbal information and will presumably expect it when interacting with machines through the medium of voice.