Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation

Current-generation automatic speech recognition (ASR) systems model spoken discourse as a quasi-linear sequence of words and phones. Because it is unusual for every phone within a word to be pronounced in a standard ("canonical") way, ASR systems often depend on a multi-pronunciation lexicon to match an acoustic sequence with a lexical unit. Since there are, in practice, many different ways for a word to be pronounced, this standard approach adds a layer of complexity and ambiguity to the decoding process which, if simplified, could potentially improve recognition performance. Systematic analysis of pronunciation variation in a corpus of spontaneous English discourse (Switchboard) demonstrates that the variation observed is more systematic at the level of the syllable than at the phonetic-segment level. Thus, syllabic onsets are realized in canonical form far more frequently than either coda or nuclear constituents. Prosodic prominence and lexical stress also appear to play an important role in pronunciation variation. The governing mechanism is likely to involve the informational valence associated with syllabic and lexical elements, and for this reason pronunciation variation offers a potential window onto the mechanisms responsible for the production and understanding of spoken language.

[1]  Godfrey Dewey,et al.  Relativ frequency of English speech sounds , 1923 .

[2]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Fosler-Lussier,et al.  EFFECTS OF SPEAKING RATE AND WORD FREQUENCY ONCONVERSATIONAL PRONUNCIATIONSEric , 1999 .

[4]  I. Lehiste chapter 7 – Suprasegmental Features of Speech , 1976 .

[5]  Katrin Kirchhoff,et al.  Robust speech recognition using articulatory information , 1998 .

[6]  Steven Greenberg,et al.  Incorporating information from syllable-length time scales into automatic speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[7]  T. A. Knott,et al.  A Pronouncing Dictionary of American English , 1944 .

[8]  B. Lindblom Spectrographic Study of Vowel Reduction , 1963 .

[9]  Steven Greenberg,et al.  INSIGHTS INTO SPOKEN LANGUAGE GLEANED FROM PHONETIC TRANSCRIPTION OF THE SWITCHBOARD CORPUS , 1996 .

[10]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[11]  Norman J. Lass,et al.  Principles of Experimental Phonetics , 1996 .

[12]  松澤 喜好,et al.  シャーロック・ホームズの冒険 = Adventures of Sherlock Holmes , 1964 .

[13]  B. Bernstein Theoretical Studies Towards a Sociology of Language , 1972 .

[14]  Florien J. van Beinum,et al.  Efficiency as an organizing principle of natural speech , 1998, ICSLP.

[15]  Malcolm J. Crocker,et al.  Encyclopedia of Acoustics , 1998 .

[16]  Eric Fosler-Lussier,et al.  Effects of speaking rate and word frequency on pronunciations in convertional speech , 1999, Speech Commun..

[17]  William J. Byrne,et al.  Pronunciation modelling using a hand-labelled corpus for conversational speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[18]  Steven Greenberg,et al.  ON THE ORIGINS OF SPEECH INTELLIGIBILITY IN THE REAL WORLD , 1997 .

[19]  Alex Waibel,et al.  Prosody and speech recognition , 1988 .

[20]  William J. Byrne,et al.  Stochastic pronunciation modelling from hand-labelled phonetic corpora , 1999, Speech Commun..

[21]  Anatole V. Lyovin,et al.  An introduction to the languages of the world , 1997 .

[22]  Ralf Kompe,et al.  Prosody in Speech Understanding Systems , 1997, Lecture Notes in Computer Science.

[23]  Vaibhava Goel,et al.  Syllable-a promising recognition unit for LVCSR , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[24]  Rosaria Silipo,et al.  AUTOMATIC TRANSCRIPTION OF PROSODIC STRESS FOR SPONTANEOUS ENGLISH DISCOURSE , 1999 .

[25]  Mitch Weintraub,et al.  Automatic Learning of Word Pronunciation from Data , 1996 .

[26]  N. Morgan,et al.  INCORPORATING CONTEXTUAL PHONETICS INTO AUTOMATIC SPEECH RECOGNITION , 1999 .

[27]  Andrej Ljolje,et al.  Automatic Generation of Detailed Pronunciation Lexicons , 1996 .

[28]  Alex Waibel,et al.  Modeling Systematic Variations in Pronunciation via a Language-Dependent Hidden Speaking Mode , 1999 .

[29]  A. B.,et al.  SPEECH COMMUNICATION , 2001 .

[30]  John Coleman,et al.  The phonetic interpretation of headed phonological structures containing overlapping constituents , 1992, Phonology.

[31]  Steven Greenberg,et al.  Robust speech recognition using the modulation spectrogram , 1998, Speech Commun..

[32]  Elmar Nöth,et al.  Prosodic processing and its use in VERBMOBIL , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  Lou Boves,et al.  Acoustic characteristics of lexical stress in continuous telephone speech , 1999, Speech Commun..

[34]  Björn Lindblom,et al.  Explaining Phonetic Variation: A Sketch of the H&H Theory , 1990 .

[35]  Mitch Weintraub,et al.  WS96 project report: Automatic learning of word pronunciation from data , 1997 .

[36]  G. Zipf The meaning-frequency relationship of words. , 1945, The Journal of general psychology.

[37]  C. W. Carter,et al.  The words and sounds of telephone conversations , 1930 .

[38]  Lori Lamel,et al.  The LIMSI continuous speech dictation system: evaluation on the ARPA Wall Street Journal task , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[39]  Stephanie Seneff,et al.  Transcription and Alignment of the TIMIT Database , 1996 .

[40]  D. Crystal The Cambridge Encyclopedia of the English Language , 1998 .

[41]  R. G. Kent,et al.  Language: Its Nature, Development, and Origin , 1923 .

[42]  Michael Riley,et al.  A statistical model for generating pronunciation networks , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[43]  M. Finke,et al.  Pronunciation modelling for conversational speech recognition: a status report from WS97 , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[44]  Steven Greenberg,et al.  The temporal properties of spoken Japanese are similar to those of English , 1997, EUROSPEECH.

[45]  B. Lindblom Spectrographic Study of Vowel Reduction , 1963 .

[46]  J. Algeo David Crystal The Cambridge Encyclopedia of the English Language , 1997 .

[47]  Kate Hunicke-Smith,et al.  Effect of Speaking Style on LVCSR Performance , 1996 .

[48]  H. Fujisaki,et al.  Recent Research Towards Advanced Man-Machine Interface through Spoken Language , 1996 .

[49]  Kuldip K. Paliwal,et al.  Automatic Speech and Speaker Recognition: Advanced Topics , 1999 .

[50]  Steven Greenberg,et al.  Performance improvements through combining phone- and syllable-scale information in automatic speech recognition , 1998, ICSLP.