Analysis and automatic recognition of tones in mandarin chinese

In tonal languages such as Mandarin Chinese, words are defined by their phonemic sequence and by the intonational patterns (tones) of their syllables. To see if the problem of tone recognition is worth solving, we propose an information theoretic measure to compare the relative importance (Functional Load) of phonological contrasts in any language. Empirical calculations show that tones are at least as important as vowels for conveying information in Mandarin. We then carry out a large and thorough investigation of possible acoustic features to recognize tones. This involves hundreds of experiments, each involves classifying over a hundred thousand syllables from ten hours of broadcast news speech. After determining a base set of features (based on pitch, duration, and overall intensity) that achieve a syllable classification rate of 58.9. Experiments on a subset of our data show that simple features based on energy in various frequency bands work better for tone recognition than those based on more complicated methods like harmonic-amplitude differences and glottal flow estimation. Further experiments determine a set of band energy features that improve classification accuracy to 63.7%, with the F score for Neutral Tone increasing from 0.345 to 0.619. This opens up a host of new features for future speech researchers in industry and academia to investigate and use. We investigate making additional use of context: if we know the tones of the surrounding syllables, we can only increase classification accuracy to 67.2%. (This provides a useful upper bound for our experiments.) While we do not have such ideal contextual information, we can use estimates of it to increase accuracy to 65.0%. Finally, we investigate the hypothesis that better articulated syllables are easier to recognize. On a small corpus of lab speech from Xu (1999), we classify syllables in focussed words with over 99% accuracy, and use this to improve classification accuracy of all syllables. However, in news broadcast speech, we find that while stronger syllables are recognized better, the difference is not enough to suggest an algorithm that makes use of it.

[1]  Hsuan-Tien Lin,et al.  A note on Platt’s probabilistic outputs for support vector machines , 2007, Machine Learning.

[2]  P. Niyogi,et al.  Quantifying the functional load of phonemic oppositions, distinctive features, and suprasegmentals , 2006 .

[3]  Patricia A. Keating,et al.  Linguistic Voice Quality , 2006 .

[4]  Gina-Anne Levow,et al.  Additional Cues for Mandarin Tone Recognition , 2006 .

[5]  Mari Ostendorf,et al.  Modeling lexical tones for mandarin large vocabulary continuous speech recognition , 2006 .

[6]  Gina-Anne Levow,et al.  Tone recognition in Mandarin using focus , 2005, INTERSPEECH.

[7]  S. Sathiya Keerthi,et al.  A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs , 2005, J. Mach. Learn. Res..

[8]  Jan P. H. van Santen,et al.  Duration and spectral balance of intervocalic consonants: A case for efficient communication , 2005, Speech Commun..

[9]  B. Rosner,et al.  Loudness predicts prominence: fundamental frequency lends little. , 2005, The Journal of the Acoustical Society of America.

[10]  Mark Hasegawa-Johnson,et al.  Acoustic correlates of non‐modal phonation in telephone speech , 2005 .

[11]  Paavo Alku,et al.  A toolkit for voice inverse filtering and parametrisation , 2005, INTERSPEECH.

[12]  Britta Lintfert,et al.  Voice quality dimensions of pitch accents , 2005, INTERSPEECH.

[13]  Hannu Pulakka Analysis of human voice production using inverse filtering, high-speed imaging, and electroglottography , 2005 .

[14]  Mei-Yuh Hwang,et al.  Incorporating tone-related MLP posteriors in the feature representation for Mandarin ASR , 2005, INTERSPEECH.

[15]  Gina-Anne Levow,et al.  Context in multi-lingual tone and pitch accent recognition , 2005, INTERSPEECH.

[16]  Yi Xu,et al.  On the Temporal Domain of Focus , 2004 .

[17]  M. Grenié,et al.  The Creaky Voice Phonation And The Organisation Of Chinese Discourse , 2004 .

[18]  Gina-Anne Levow,et al.  The functional load of tone in Mandarin is as high as that of vowels , 2004, Speech Prosody 2004.

[19]  Chih-Jen Lin,et al.  Probability Estimates for Multi-class Classification by Pairwise Coupling , 2003, J. Mach. Learn. Res..

[20]  Chilin Shih,et al.  Quantitative measurement of prosodic strength in Mandarin , 2003, Speech Commun..

[21]  J. Hirsch,et al.  fMRI Evidence for Cortical Modification during Learning of Mandarin Lexical Tone , 2003, Journal of Cognitive Neuroscience.

[22]  Fabio Tamburini,et al.  Automatic prosodic prominence detection in speech using acoustic features: an unsupervised system , 2003, INTERSPEECH.

[23]  Ailbhe Ní Chasaide,et al.  The role of voice quality in communicating emotion, mood and attitude , 2003, Speech Commun..

[24]  Chilin Shih,et al.  Prosody modeling with soft templates , 2003, Speech Commun..

[25]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[26]  Xiaochuan Niu,et al.  Prediction and synthesis of prosodic effects on spectral balance of vowels , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[27]  P. Alku,et al.  Normalized amplitude quotient for parametrization of the glottal flow. , 2002, The Journal of the Acoustical Society of America.

[28]  Anders Eriksson,et al.  Syllable prominence: a matter of vocal effort, phonetic distinct-ness and top-down processing , 2001, INTERSPEECH.

[29]  Tsan Huang,et al.  The Interplay of Perception and Phonology in Tone 3 Sandhi in Chinese Putonghua , 2001 .

[30]  Chilin Shih,et al.  Stem-ML: language-independent prosody description , 2000, INTERSPEECH.

[31]  Stephanie Seneff,et al.  Improved tone recognition by normalizing for coarticulation and intonation effects , 2000, INTERSPEECH.

[32]  Charles L. Wayne Multilingual Topic Detection and Tracking: Successful Research Enabled by Corpora and Evaluation , 2000, LREC.

[33]  C. Shih,et al.  A Declination Model of Mandarin Chinese , 2000 .

[34]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[35]  Yi Xu,et al.  Effects of tone and focus on the formation and alignment of f0contours , 1999 .

[36]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[37]  Nick Campbell,et al.  Accent, stress, and spectral tilt , 1997 .

[38]  Yi Xu Contextual tonal variations in Mandarin , 1997 .

[39]  Yi Xu,et al.  What can tone studies tell us about intonation , 1997 .

[40]  Agaath M. C. Sluijter,et al.  Spectral balance as an acoustic correlate of linguistic stress. , 1996, The Journal of the Acoustical Society of America.

[41]  J Kreiman,et al.  The perceptual structure of pathologic voice quality. , 1996, The Journal of the Acoustical Society of America.

[42]  Rongrong Liao Pitch contour formation in Mandarin Chinese : a study of tone and intonation , 1994 .

[43]  Donald B. Percival,et al.  Spectral Analysis for Physical Applications , 1993 .

[44]  Yi Xu Perception of coarticulated tones. , 1991 .

[45]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[46]  Keikichi Hirose,et al.  Analysis and modeling of tonal features in polysyllabic words and sentences of the standard Chinese , 1990, ICSLP.

[47]  J. Perkell,et al.  Glottal airflow and transglottal air pressure measurements for male and female speakers in soft, normal, and loud voice. , 1988, The Journal of the Acoustical Society of America.

[48]  C. Gobl Voice source dynamics in connected speech , 1988 .

[49]  David Carter,et al.  An information-theoretic analysis of phonetic dictionary access , 1987 .

[50]  Chilin Shih,et al.  The prosodic domain of tone sandhi in Chinese , 1986 .

[51]  Rud S. Meyerstein Functional Load: Descriptive Limitations Alternatives of Assessment and Extensions of Application , 1970 .

[52]  趙 元任,et al.  A grammar of spoken Chinese = 中國話的文法 , 1968 .

[53]  William S.-Y. Wang The Measurement of Functional Load , 1967 .

[54]  P. Ladefoged Preliminaries to linguistic phonetics , 1967 .

[55]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .