Unsupervised and Semi-supervised Learning of Tone and Pitch Accent

Recognition of tone and intonation is essential for speech recognition and language understanding. However, most approaches to this recognition task have relied upon extensive collections of manually tagged data obtained at substantial time and financial cost. In this paper, we explore two approaches to tone learning with substantially reductions in training data. We employ both unsupervised clustering and semi-supervised learning to recognize pitch accent in English and tones in Mandarin Chinese. In unsupervised Mandarin tone clustering experiments, we achieve 57-87% accuracy on materials ranging from broadcast news to clean lab speech. For English pitch accent in broadcast news materials, results reach 78%. In the semi-supervised framework, we achieve Mandarin tone recognition accuracies ranging from 70% for broadcast news speech to 94% for read speech, outperforming both Support Vector Machines (SVMs) trained on only the labeled data and the 25% most common class assignment level. These results indicate that the intrinsic structure of tone and pitch accent acoustics can be exploited to reduce the need for costly labeled training data for tone learning and recognition.

[1]  Chilin Shih,et al.  Chinese tone modeling with stem-ML , 2000, INTERSPEECH.

[2]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[3]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[4]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[5]  Gina-Anne Levow,et al.  Tone recognition in Mandarin using focus , 2005, INTERSPEECH.

[6]  Mari Ostendorf,et al.  A Multi-level Model for Recognition of Intonation Labels , 1997, Computing Prosody.

[7]  Hiroya Fujisaki,et al.  Dynamic Characteristics of Voice Fundamental Frequency in Speech and Singing , 1983 .

[8]  Xuejing Sun,et al.  Pitch accent prediction using ensemble machine learning , 2002, INTERSPEECH.

[9]  Ken Chen,et al.  Speech Recognition Models of the Interdependence Among Syntax, Prosody, and Segmental Acoustics , 2004, HLT-NAACL 2004.

[10]  Boonserm Kijsirikul,et al.  Support Vector Machines for Thai Phoneme Recognition , 2001, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[11]  Robert Proulx,et al.  Neural‐network simulation of tonal categorization based on F0 velocity profiles , 2005 .

[12]  Mikhail Belkin,et al.  Manifold Regularization : A Geometric Framework for Learning from Examples , 2004 .

[13]  Stephanie Seneff,et al.  Improved tone recognition by normalizing for coarticulation and intonation effects , 2000, INTERSPEECH.

[14]  Yi Xu Contextual tonal variations in Mandarin , 1997 .

[15]  Ye Tian,et al.  Tone articulation modeling for Mandarin spontaneous speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Yi Xu,et al.  Maximum speed of pitch change and how it may relate to speech. , 2002, The Journal of the Acoustical Society of America.

[17]  Gina-Anne Levow,et al.  Context in multi-lingual tone and pitch accent recognition , 2005, INTERSPEECH.

[18]  John H. L. Hansen,et al.  University of Colorado Dialogue Systems for Travel and Navigation , 2001, HLT.

[19]  Igor Fischer,et al.  New Methods for Spectral Clustering. , 2004 .

[20]  Yi Xu,et al.  A pitch target approximation model for F0 contours in Mandarin , 1999 .