APEX - an articulatory model for speech and singing

The APEX articulatory synthesis model is being developed as a joint project at the Department of Speech, Music and Hearing at the Royal Institute of Technology and at the Department of Linguistics at Stockholm University. It is a direct development of an earlier vowel model [1], implemented as a computer program under Windows [2]. It calculates formants and produces sound according to articulatory profiles from a virtual vocal tract, it generates possible articulatory configurations within a specified articulatory space and it also parameterizes and animates series of articulatory configurations. The default vocal tract is based on lateral X-ray data from a male adult speaker complemented with frontal and mid-sagittal measures from a standard vocal tract. However, the model can be calibrated with and run on vocal tract data from any individual. The APEX model is used for testing and shedding light on theories of speech and singing production, in general as well as for specific speakers or singers. It is primarily a research instrument, continually developed according to new findings and the needs of its users. AIMS Articulatory models ideally allow control of the positions of the articulators, the lower jaw, the lips, the root, the body and the tip of the tongue, the velum, and the larynx. If accurate, such models will produce realistic area functions. Articulatory models are similar to but not equivalent with area function models, where the control parameters are location and degree of the tongue constriction plus the cross-sectional area of the lip opening; such models may produce also area function unavailable to a real vocal tract. Our work with an articulatory model started many years ago, and was originally based on tracings of X-ray profiles of a Swedish subject who produced a dozen sustained vowel sounds [1]. We have now updated and expanded this model. In its present form, called APEX, it runs as a PC program and produces sounds by means of a conventional sound card. The ultimate goal of our APEX model is to contribute to a better understanding of the function and potentials of the human voice organ in speech and singing. A physiologically realistic model should offer an efficient tool for translating acoustic characteristics into articulatory gestures and settings, and vice versa. An articulatory model would also offer powerful pedagogical means. APPEARANCE The APEX program runs under Windows and is written in C++ [2]. The display (Figure 1, upper panel) shows a virtual vocal tract (VT) profile, articulatory parameter regulators and function toolbars allowing variation of the position and shape of the model's lips, tongue tip, tongue body, jaw opening and larynx height. A coordinate system template defined with respect to fixed VT landmarks is then applied to this profile. This template is used for measuring sagittal distances along the VT midline at a number of points from the glottis to the lips. These distances are then converted into crosssectional areas (Figure 1, lower panel) using anatomically motivated and speaker-dependent rules. This area function is used for calculating the formant frequencies. Using the SENSYN speech synthesizer sound examples of articulatory configurations can be obtained. ARTICULATORY PARAMETERS AND CALIBRATION To generate an articulatory profile, the shape and position of the fixed contours (mandible, maxilla, posterior pharyngeal wall and larynx) and the variable contours (tongue body, tongue blade and lips) are combined. The adjustable articulatory parameters are the following: the one-dimensional mandible ranging from 0 to 25 mm along a curvi-linear, empirically determined path; the larynx which can be translated and rotated in the x/y plane; the lips described in terms of two parameters: width and height; the tongue blade created by a parabolic function Figure 1. The APEX articulatory model. In the lower panel, the dark pattern represents the area function. The line refers to the VT cross-distances. attached to the tongue body and controlled by two parameters: protrusion (extension-retraction) and elevation (displacement from neutral); and the tongue body specified in terms of two parameters: anterior-posterior position and displacement (deviation from neutral). The geometry of the APEX vocal tract is based on articulatory X-ray data [3] selected from an articulatory database that contains 12 subjects. Using a digital X-ray technique the subjects were recorded for 20 seconds each at a speed of 50 images/second. Audio signals were registered synchronously. To calibrate the APEX model with data from an individual speaker, the vocal tract contours from the vowels /i, #, u/ and a neutral '-like vowel were traced on X-ray images, using the Osiris Imaging Software (Figure 2). The general strategy was to have the model represent these vowel configurations as faithfully as possible and then to derive intermediate articulations by physiologically motivated interpolation rules. All X-ray tracings were specified in the program by x/y coordinates and extracted as labeled lists, using a specially developed tool (Papex). The tracings were then rescaled to real mm, and transferred to the origin of the coordinate system in APEX. Figure 2. X-ray image with contour tracings of an articulation of the vowel [u:]. To allow for control of the jaw aperture, a path was derived by plotting the positions of mandible reference points versus the maxilla from selected articulations (/s, u, y, ', o, #/). Using the equation of this plotted jaw function, the path was expanded by extrapolation. The contours were stored as lists of plain x/y mm coordinates in a calibration text file, written in a syntax readable by APEX. This file contained information about scaling, origin, subject identity, templates for fixed and variable contours, the four selected reference vowels specified in terms of their tongue blade, tongue body, larynx position and jaw opening, jaw path described. The input file is loaded in the APEX program, which opens three mathematical models for calibration of tongue body, tongue blade and larynx, respectively. Calibration is achieved by adjusting the control constants of these models so as to simulate the vocal tract configurations for each reference vowel. This is realized in a graphic interface, showing the observed and the modeled curve (Figure 3). The goodness of fit is displayed as the root mean square error for every adjustment. The calibration process implies modeling of 12 tracings (four vowels with three parts each). The class of all possible tongue bodies of the modeled speaker is generated by interpolating between the reference configurations. In essence, this interpolation reflects the action of the hyoglossus, the styloglossus and the genioglossus, the major muscular determinants of shape and position in vowel articulation. A more detailed account for the calibration of the APEX can be found in Stark et al. [4] [5]. By adjusting the model parameters the virtual subject's articulatory possibilities can be investigated and compared with those of the real speaker. ACOUSTICAL CHARACTERISTICS To obtain the acoustic correlate of a given articulatory profile, the first step is to determine the vocal tract 'midline'. A coordinate system template is positioned relative to anatomical landmarks (see Figure 1, upper panel). Lines perpendicular to this midline are used for measuring sagittal VT distances (d). These distances are then transformed into crosssectional areas (A) according to