Unsupervised Learning of Tone and Pitch Accent

Recognition of tone and intonation is essential for speech r ecognition and language understanding. However, most approaches to this recognition task have relied upon extensive collect ions of manually tagged data obtained at substantial time and fina ncial cost. In this paper, we explore unsupervised clusterin g approaches to recognize pitch accent in English and tones in Mandarin Chinese. In unsupervised Mandarin tone clustering experiments, we achieve 57-87% accuracy on materials ranging from broadcast news to clean lab speech. For English pitch accent in broadcast news materials, results reach 78%. These results indicate that the intrinsic structure of tone and pitch accent ac oustics can be exploited to reduce the need for costly labeled traini ng data for tone learning and recognition.

[1]  Mari Ostendorf,et al.  A Multi-level Model for Recognition of Intonation Labels , 1997, Computing Prosody.

[2]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[3]  Ye Tian,et al.  Tone articulation modeling for Mandarin spontaneous speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Robert Proulx,et al.  Neural‐network simulation of tonal categorization based on F0 velocity profiles , 2005 .

[6]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[7]  Yi Xu,et al.  A pitch target approximation model for F0 contours in Mandarin , 1999 .

[8]  John H. L. Hansen,et al.  University of Colorado Dialogue Systems for Travel and Navigation , 2001, HLT.

[9]  Yi Xu Contextual tonal variations in Mandarin , 1997 .

[10]  Boonserm Kijsirikul,et al.  Support Vector Machines for Thai Phoneme Recognition , 2001, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[11]  Yi Xu,et al.  Effects of tone and focus on the formation and alignment of f0contours , 1999 .

[12]  Yi Xu,et al.  Maximum speed of pitch change and how it may relate to speech. , 2002, The Journal of the Acoustical Society of America.

[13]  Stephanie Seneff,et al.  Improved tone recognition by normalizing for coarticulation and intonation effects , 2000, INTERSPEECH.

[14]  Xuejing Sun,et al.  Pitch accent prediction using ensemble machine learning , 2002, INTERSPEECH.

[15]  Ken Chen,et al.  Speech Recognition Models of the Interdependence Among Syntax, Prosody, and Segmental Acoustics , 2004, HLT-NAACL 2004.

[16]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[17]  Igor Fischer,et al.  New Methods for Spectral Clustering. , 2004 .