Computational models of prosody in the Nguni languages

Abstract We investigate two related issues in the computational model-ing of Nguni prosody, based on annotated databases of isiZuluand isiXhosa speech. Firstly, we show that a simple templatecan be used to describe the tonal characteristics of vowels andadjectives spoken in isolation, and that contextual effects haveonly a mild impact on this template. This analysis was basedon a simple mapping between pitch and tone; in the second partof the paper, we show that pitch and amplitude actually playcomparable roles in producing tonal percepts. 1. Prosodic models in multilingual speechtechnology Although the complexity of prosody is widely recognized [8],the lack of widely-accepted descriptive standards for prosodicphenomena have meant that prosodic systems for most of thelanguages of the world have, at best, been described in impres-sionistic rule-based terms. This situation has become partic-ularly noticeable with the development of increasingly capabletext-to-speech (TTS)systems [2]. Such systems require detailedprosodic models to sound natural, and the development of thesedetailed models poses a significant challenge to the descrip tivesystems employed for prosodic quantities. For languages suchas English or Japanese, for example, the ToBI marking sys-tem [1] has gained a significant following because of its util ityin producing predictions for these quantities. These models al-low developers to employ the methods of pattern recognition tocompute numerical targets for the fundamental frequency andamplitude of spoken utterances, based on their written repre-sentation.For the languages of Southern Africa, the deficiencies inour modeling capabilities are acute. It has long been recognizedthat, for example, the languages of the Nguni family (such asisiZulu and isiXhosa) have an intricate tonal structure – in fact,the adequate description of this structure was one of the ma-jor early successes of autosegmental phonology. However, littlework of a quantitative nature has been published, and as Roux[11] points out, there are significant contradictions and im preci-sions in the literature on this topic, which partially stems fromthe lack of quantitative, measurement-driven analysis.In the current paper we detail initial results from a programthat we have initiated in order to develop detailed, reliable into-nation models for the languages of Southern Africa. In particu-lar, we discuss various measurements that have been obtained inorder to model the fundamental frequency contours of isiZuluand isiXhosa, and report on initial investigations on the relation-ship between pitch, intensity, and lexical tone.A wide-ranging overview over intonation in numerous lan-guages is provided in [8]; here, we briefly review some of thefacts pertinent to our investigations – partially to fix term inol-ogy, since there is not universal agreement on the semantics ofthis domain. We use the terms prosody and intonation inter-changeably to refer to the melodic pattern of an utterance. Inother words, it is the non-phonetic content of speech; at thelinguistic level, this is represented by variables related to tone,stress and rhythm. These variables are either attached to spe-cific words, in which case they are called lexical quantities ,or to (generally) larger units, in which case they are tagged assupralexical or syntactic. Corresponding tothese linguistic vari-ables are a number of physically measurable quantities – mostnoticeably fundamental frequency, intensity and duration. Al-though fundamental frequency generally is most strongly corre-lated with tone, intensity with stress, and duration with rhythm,this correspondence is far from perfect. Thus, stress may be in-dicated with changes in fundamental frequency or duration aswell.