A hierarchical linguistic information-based model of English prosody: L2 data analysis and implications for computer-assisted language learning

Abstract The paper presents a prosody model of native English (L1) continuous speech as corrective prosodic feedback for non-native learners. The model incorporates both hierarchical discourse association and information structure to (1) pinpoint the prosodic features of multi-phrase continuous speech, and (2) simulate native-like expressive speech using corpus of North American and Taiwan L2 English. The bottom-up, additive, data-driven model aims to generate L1-like expressive continuous speech with built-in phonetic and phonological specifications at the lexical level, syntactic/semantic specifications at the next higher phrase and sentence levels, and completed with patterned paragraph associations and prosodic projections of information allocation at higher levels. The hierarchical model successfully allows us to identify L1-L2 differences by prosodic modules/patterns as novel additional features “discourse structure” and “information density” reliably nail down L1-L2 prosodic differences related to phrase association as well as information placement. Our L1 prosodic model with the proposed predictors and optimized model trained from L1 speech corpus showed increase of prediction over existing methods. As a corrective feedback for L2 learners, these predicted L1 prosodic features were compared with a baseline model by objective evaluation (RMS error and correlation) then superimposed onto the L2 speech tokens. Resynthesized L2 tokens were subsequently compared with the original L2 tokens for degrees of perceived accent using subjective evaluation (native-listener perception test). We believe the proposed model can be an effective alternative for implementing computer-assisted language learning (CALL) systems that helps generate L1-like prosody from text, and at the same time serves as corrective feedback for L2 learners.

[1]  Steve J. Young,et al.  Phone-level pronunciation scoring and assessment for interactive language learning , 2000, Speech Commun..

[2]  Knud Lambrecht,et al.  Information structure and sentence form , 1994 .

[3]  Chiu-yu Tseng,et al.  L1/L2 difference in phonological sensitivity and information planning — Evidence from F0 patterns , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[4]  Chiu-yu Tseng,et al.  The long road from phonological knowledge to phonetic realization: An acoustic account of the temporal composition of Mandarin L2 English , 2016 .

[5]  Chiu-yu Tseng,et al.  What's in the F0 of Mandarin Speech: Tones, Intonation and Beyond , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[6]  M. Munro Nonsegmental Factors in Foreign Accent , 1995, Studies in Second Language Acquisition.

[7]  Shizuka Nakamura Analysis of Relationship between Duration Characteristics and Subjective Evaluation of English Speech by Japanese learners with regard to Contrast of the Stressed to the Unstressed , 2010 .

[8]  Cinzia Avesani,et al.  Broad, narrow and contrastive focus in Florentine Italian , 2003 .

[9]  Ekkehard König,et al.  The Meaning of Focus Particles: A Comparative Perspective , 1991 .

[10]  Chiu-yu Tseng,et al.  Prosodic Differences between Taiwanese L2 and North American L1 speakers— Under-differentiation of Lexical Stress , 2014 .

[11]  Yi Xu,et al.  Speech melody as articulatorily implemented communicative functions , 2005, Speech Commun..

[12]  Ulrike Gut,et al.  The prosodic marking of information status in Malaysian English , 2013 .

[13]  Hansjörg Mixdorff,et al.  Speech Technology, ToBI, and Making Sense of Prosody , 2002 .

[14]  C. Gussenhoven,et al.  Prosodic effects of focus in Dutch declaratives , 2008, Speech Prosody 2008.

[15]  Jacques C. Koreman,et al.  Local and Global Cues in the Prosodic Realization of Broad and Narrow Focus in Bulgarian , 2017, Phonetica.

[16]  Chiu-yu Tseng,et al.  An initial investigation of L1 and L2 discourse speech planning in English , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[17]  Chiu-yu Tseng,et al.  Phonetic aspects of content design in AESOP (Asian English Speech cOrpus Project) , 2009, 2009 Oriental COCOSDA International Conference on Speech Database and Assessments.

[18]  D. Cook Language and Consciousness in Hegel's Jena Writings , 1972 .

[19]  Vassilios Digalakis,et al.  Automatic pronunciation evaluation of foreign speakers using unknown text , 2007, Comput. Speech Lang..

[20]  R. Fox,et al.  Between-speaker and within-speaker variation in speech tempo of American English. , 2010, The Journal of the Acoustical Society of America.

[21]  Chiu-yu Tseng,et al.  Comparison of English narrow focus production by L1 English, Beijing and Taiwan Mandarin speakers , 2012, 2012 International Conference on Speech Database and Assessments.

[22]  J. Atkinson,et al.  Inter- and intraspeaker variability in fundamental voice frequency. , 1976, The Journal of the Acoustical Society of America.

[23]  Peter Auer,et al.  A learning rule for very simple universal approximators consisting of a single layer of perceptrons , 2008, Neural Networks.

[24]  Eric Keller,et al.  Representing Speech Rhythm. , 2001 .

[26]  Hansjörg Mixdorff,et al.  A novel approach to the fully automatic extraction of Fujisaki model parameters , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[27]  Peggy Pik Ki Mok,et al.  Interlanguage influence in cues of narrow focus: A study of Hong Kong English , 2015, ICPhS.

[28]  Keikichi Hirose,et al.  Synthesis by rule of voice fundamental frequency contours of spoken Japanese from linguistic information , 1984, ICASSP.

[29]  Chiu-yu Tseng,et al.  Underdifferentiation of English lexical stress contrasts by L2 taiwan speakers , 2013, SLaTE.

[30]  Sumio Ohno,et al.  Analysis and synthesis of fundamental frequency contours of Standard Chinese using the command-response model , 2005, Speech Commun..

[31]  David Malah,et al.  Time-domain algorithms for harmonic bandwidth reduction and time scaling of speech signals , 1979 .

[32]  John C. L. Ingram,et al.  Prosodic transfer in Vietnamese acquisition of English contrastive stress patterns , 2008, J. Phonetics.

[33]  Zöe Handley Is text-to-speech synthesis ready for use in computer-assisted language learning? , 2009, Speech Commun..

[34]  M. Benrabah WORD-STRESS – A SOURCE OF UNINTELLIGIBILITY IN ENGLISH , 1997 .

[35]  Chiu-yu Tseng,et al.  Fluent speech prosody: Framework and modeling , 2005, Speech Commun..

[36]  Mireia Farrús,et al.  Using hierarchical information structure for prosody prediction in content-to-speech applications , 2016 .

[37]  John Laver,et al.  The gift of speech , 1991 .

[38]  Dmitry Sityaev,et al.  Phonetic and Phonological Correlates of Broad, Narrow and Contrastive Focus in English , 2003 .

[39]  Ann Wennerstrom,et al.  Intonational Meaning in English Discourse: A Study of Non-Native Speakers. , 1994 .

[40]  Tracey M. Derwing,et al.  ACCENT, INTELLIGIBILITY, AND COMPREHENSIBILITY , 1997, Studies in Second Language Acquisition.

[41]  Heiga Zen,et al.  AN HMM-BASED SPEECH SYNTHESIS SYSTEM APPLIED TO ENGLISH , 2003 .

[42]  E. Pedhazur Multiple Regression in Behavioral Research: Explanation and Prediction , 1982 .

[43]  Robert Andersen Modern Methods for Robust Regression , 2007 .

[44]  Hiroya Fujisaki,et al.  Information, prosody, and modeling - with emphasis on tonal features of speech - , 2004, Speech Prosody 2004.

[45]  Gérard Bailly,et al.  SFC: A trainable prosodic model , 2005, Speech Commun..

[46]  David Coniam,et al.  Voice Recognition Software Accuracy with Second Language Speakers of English. , 1999 .

[47]  Mireia Farrús,et al.  The Information structure-prosody interface revisited , 2014 .

[48]  D. Robert Ladd,et al.  On the phonetics and phonology of "segmental anchoring" of F0: evidence from German , 2004, J. Phonetics.

[49]  Shambhu Nath Saha,et al.  Discourse prosody planning in native (L1) and nonnative (L2) (L1-Bengali) English: a comparative study , 2017, Int. J. Speech Technol..

[50]  B. Thorén SWEDISH ACCENT - DURATION OF POST-VOCALIC CONSONANTS IN NATIVE SWEDES SPEAKING ENGLISH AND GERMAN , 2007 .

[51]  M. Halliday NOTES ON TRANSITIVITY AND THEME IN ENGLISH. PART 2 , 1967 .

[52]  W. Baker,et al.  LEARNING SECOND LANGUAGE SUPRASEGMENTALS: Effect of L2 Experience on Prosody and Fluency Characteristics of L2 Speech , 2006, Studies in Second Language Acquisition.

[53]  K. Koehler,et al.  The Relationship Between Native Speaker Judgments of Nonnative Pronunciation and Deviance in Segmentais, Prosody, and Syllable Structure , 1992 .

[54]  Niels O. Schiller,et al.  Vowel duration in English as a second language among Javanese learners , 2015, ICPhS.

[55]  Zied Elouedi,et al.  Incremental Induction of Belief Decision Trees in Averaging Approach , 2014, DEXA.

[56]  Ulrike Gut Non-native Speech: A Corpus-based Analysis of Phonological and Phonetic Properties of L2 English and German , 2009 .

[57]  Dolores Ramírez Verdugo Non-native interlanguage intonation systems : A study based on a computerized corpus of Spanish learners of English , 2002 .

[58]  B. Wells,et al.  Prosodic Variation in Southern British English , 2000, Language and speech.

[59]  Wolfgang Grosser,et al.  On the acquisition of tonal and accentual features of English by Austrian learners , 1997 .

[60]  Wallace L. Chafe,et al.  Language and Consciousness. , 1974 .