Read my tongue movements: bimodal learning to perceive and produce non-native speech /r/ and /l/

Abstract This study investigated the effectiveness of Baldi for teaching non-native phonetic contrasts, by comparing instruction illustrating the internal articulatory processes of the oral cavity versus instruction providing just the normal view of the tutor’s face. Eleven Japanese speakers of English as a second language were bimodally trained under both instruction methods to identify and produce American English /r/ and /l/ in a within-subject design. Speech identification and production improved under both training methods although training with a view of the internal articulators did not show an additional benefit. A generalization test showed that this learning transferred to the production of new words. 1. Introduction All humans have the unique ability to acquire the phonological system of a first language with ease; however, once that phonetic system is established, it is challenging to acquire the phonetic system of a subsequent language. Part of this difficulty stems from the fact that different languages utilize different subsets of phonetic contrasts and show subtle differences within the same phonetic category. Because there is not a universal mapping between phonological features and phonetic parameters (Lindau & Ladefoged, 1986), speech production and perception reflect strong influences of the phonological system of a person’s first language. One of the most well-documented cases of difficulty with a second language is the English contrast /r/ vs. /l/ by speakers of Japanese. This limitation in discrimination and production most likely reflects the lack of a contrast between /r/ and /l/ in Japanese phonology, which causes them to poorly discriminate and produce the /r-l/ contrast in English. Numerous studies have shown that discrimination of non-native contrasts can be improved with auditory training (Lively, Logan & Pisoni, 1993; Werker & Logan, 1985). Furthermore, Hardison (2002) found somewhat better learning of /r/ and /l/ by Japanese and Korean speakers when training involved a frontal view of the talker than simply auditory speech. The bimodal advantage for identification performance was larger for the more difficult test items. There was also an indication that bimodal training improved speech production more than auditory training but it was not clear whether these differences were significantly different. Extending Hardison’s study, we test the hypothesis that both perception and production of these segments can be improved with bimodal speech training in which movement of the internal articulators is illustrated. Baldi, our computer-animated talking head aligned with auditory speech, is more capable than a human in demonstrating articulatory processes (Massaro, 1998). The skin of our talking head can be made transparent or eliminated so that the inside of the vocal tract is visible, or a cutaway view of the head along the sagittal plane can be shown (see Figure 1). The inside articulators can be displayed from different vantage points so that the subtleties of articulation can be optimally visualized as well. There is also highlighting (by changing color) of the areas where the tongue hits the palate and teeth. This study tests whether instruction revealing the internal articulatory processes of the oral cavity is more effective than instruction with just a normal frontal view of the tutor’s face. It is hypothesized that the unique properties of our program would help Japanese native speakers perceive and produce English speech more accurately. Other issues that this report addresses include: 1) how learning occurs over the course of the study and 2) how learning differs for different minimal word pairs. The type of training method employed on each day varied (inside articulators present (A) vs. no inside articulators present (NA)). The order of training was dependent on the participant’s group (A-NA vs. NA-A). One group with 5 participants (A-NA) received training involving visible internal articulatory movements for the first half of the study (days 1-3), and no visible internal articulatory movements for the last half of the study (days 4-6); the other group with 6 participants (NA-A) received the opposite training sequence, with no visible internal articulation training during the first half of the study and visible internal articulation training for the last half of the study. We expected that the A training method to give better performance accuracy than the NA training method for both Figure1. Two of the four presentation views in the internal-articulators-present (A) condition, giving a side view of Baldi when his skin was made transparent (left) and a side view of Baldi's tongue, teeth, and palate (right).