Recognition of Prosodic Categories in Swedish: Rule Implementation

Descriptive rules for recognition of prosodic categories in Swedish are currendy being implemented in an automatic prosody recognition scheme. An algorithm is described in which the speech signal is segmented into syllables (tonal segments) using intensity measurements and fundamental frequency. Each syllable is then given six values related to fundamental frequency and duration. The values for each syllable are tested against conditions which describe the prosodic categories. The category attaining the highest score is assigned to the syllable. Preliminary results for two sets of rule conditions for ten test sentences are presented. INTRODUCTION This paper represents a status report, from an ongoing joint research project shared by the Phonetics Departments at the Universities of Lund and Stockholm. The project, "Prosodic Parsing for Swedish Speech Recognition", is sponsored by the National Swedish Board for Technical Development and is part of the National Swedish Speech Recognition Effort in Speech Technology. The primary goal of the project is to develop a method for extracting relevant prosodic information from a speech signal. We hope to devise a system which from a speech signal input wi l l provide us with a transcription showing syllabification of the utterance, categorization of the syllables into STRESSED and UNSTRESSED, categorization of the stressed syllables into W O R D A C C E N T S (ACUTE and G R A V E ) and categorization of the word accents into F O C A L and N O N F O C A L accents. We also hope to be able to identify JUNCTURE (connective and boundary signals for phrases). We are currently working with 20 prosodically varied sentences spoken by two speakers of Stockholm Swedish. The type and structure of the information to be presented to the recognizer has been based on a series of mingogram reading experiments (see House et al. 1987a, 1987b). In the first experiment, an expert in Swedish prosody (Gosta Bruce) was presented with mingogram representations of ten unknown sentences showing a duplex oscillogram, fundamental frequency contour and intensity curve. On the basis of this information, he was able to identify 85% of all * At Stockholm University, Department of Linguistics and Phonetics 154 DAVID HOUSE, GÖSTA BRUCE, LARS ERIKSSON AND FRANCISCO LACERDA occurrences of the prosodic categories referred to above. Descriptive rules were then formulated and tested using two non-expert mingogram readers. Their scores were 78% and 69%. Our scheme for automatic prosodic recognition can be broken down into three main steps (see Figure 1). First, intensity and fundamental frequency are extracted from the digitized signal. Second, intensity relationships and fundamental frequency information are used to automatically segment the utterance into "tonal segments" which ideally correspond to syllabic units. The prosody recognition rules are then applied to these tonal segments giving us prosodic categories as the output of the system. The system is being developed for use on an IBM-AT. Current testing of the segmentation algorithm, however, has been carried out using the ILS signalprocessing package on a V A X 11/730. Speech Fo INT Autoseg Rules Categories Figure 1. The main components of the prosody recognition scheme. A U T O M A T I C SEGMENTATION The automatic segmentation component of the recognition scheme has been designed using intensity measurements in much the same way as that described by Mertens 1987. Similar algorithms have been described by Mermelstein 1975, Lea 1980, and Blomberg and Elenius 1985. The speech signal is first low-pass filtered at 4 kHz (anti-aliasing) and sampled at 10 kHz. An intensity curve is obtained from this signal using the RMS intensity parameter in the ILS program package. This curve is referred to as the unfiltered intensity curve. Fundamental frequency is also extracted using a modified cepstral processing technique included in the ILS package. A n additional intensity curve is obtained from a digital band-pass filtered version of the sampled signal (0.5-4 kHz, 72 dB/oct). This curve is referred to as the filtered intensity curve. Both intensity curves are smoothed (moving average). Figure 2 presents a graphic overview of the segmentation process where steps 1 and 2 represent the above described filtering, analysis and smoothing. RECOGNITION OF PROSODIC CATEGORIES IN SWEDISH 155 Speech signal r—I B a n d 1 _ 1 P«" 1 Analysis and Syllablic smoothing segmentation