Improving Automatic Classification of Prosodic Events by Pairwise Coupling

This paper presents a system that automatically labels tones and break indices (ToBI) events. The detection (binary classification) of prosodic events has received significantly more attention from researchers than its classification because of the intrinsic difficulty of classification. We focus on the classification problem, identifying eight types of pitch accent tones, nine types of boundary tones and five types of break indices. The complex multi-class classification problem is divided into several simpler problems, by means of pairwise coupling. We propose to combine two-class classifiers to achieve the multi-class classification because two-class problems provide high accuracy results. Furthermore, complementarity between artificial neural networks and decision trees classifiers has been exploited to improve the final system, combining their outputs using a fusion method. This proposal, together with the adequate feature extraction that includes the use of features such as the Tilt and Bézier parameters, allows us to achieve a total classification accuracy of 70.8% for pitch accents, 84.2% for boundary tones and 74.6% for break indices, on the Boston University Radio News Corpus. The analysis of the misclassified samples shows that the types of mistakes that the system makes do not differ significantly from the common confusions that are observed in manual ToBI inter-transcriber tests.

[1]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[2]  David A. van Leeuwen,et al.  Fusion of Heterogeneous Speaker Recognition Systems in the STBU Submission for the NIST Speaker Recognition Evaluation 2006 , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Mark Hasegawa-Johnson,et al.  A Maximum Likelihood Prosody Recognizer , 2004 .

[4]  Bayya Yegnanarayana,et al.  Intonation modeling for Indian languages , 2009, Comput. Speech Lang..

[5]  Shrikanth S. Narayanan,et al.  An automatic prosody recognizer using a coupled multi-stream acoustic model and a syntactic-prosodic language model , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[6]  Julia Hirschberg,et al.  Pitch Accent in Context: Predicting Intonational Prominence from Text , 1993, Artif. Intell..

[7]  Shrikanth S. Narayanan,et al.  Fine-grained pitch accent and boundary tone labeling with parametric F0 features , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Rebecca Herman,et al.  The Conceptual Similarity of Intonational Tones and its Effects on Intertranscriber Reliability , 2002, Language and speech.

[9]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[10]  Marcos Faúndez-Zanuy,et al.  Fast on-line signature recognition based on VQ with time modeling , 2011, Eng. Appl. Artif. Intell..

[11]  Jennifer Cole,et al.  Speaker-Independent Automatic Detection of Pitch Accent , 2004 .

[12]  Shrikanth S. Narayanan,et al.  Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Mari Ostendorf,et al.  A dynamical system model for generating fundamental frequency for speech synthesis , 1999, IEEE Trans. Speech Audio Process..

[14]  Gary Geunbae Lee,et al.  Automatic corpus-based tone and break-index prediction using K-ToBI representation , 2002, TALIP.

[15]  Chih-Jen Lin,et al.  Probability Estimates for Multi-class Classification by Pairwise Coupling , 2003, J. Mach. Learn. Res..

[16]  Xuejing Sun,et al.  Pitch accent prediction using ensemble machine learning , 2002, INTERSPEECH.

[17]  David Escudero Mancebo,et al.  Applying data mining techniques to corpus based prosodic modeling , 2007, Speech Commun..

[18]  Joseph P. Olive,et al.  Text-to-speech synthesis , 1995, AT&T Technical Journal.

[19]  Jeung-Yoon Choi,et al.  Simultaneous recognition of words and prosody in the Boston University Radio Speech Corpus , 2005, Speech Commun..

[20]  Julia Hirschberg,et al.  Evaluation of prosodic transcription labeling reliability in the tobi framework , 1994, ICSLP.

[21]  Paul Taylor,et al.  The rise/fall/connection model of intonation , 1994, Speech Communication.

[22]  Julia Hirschberg,et al.  Detecting Pitch Accents at the Word, Syllable and Vowel Level , 2009, NAACL.

[23]  Mari Ostendorf,et al.  Automatic labeling of prosodic patterns , 1994, IEEE Trans. Speech Audio Process..

[24]  Julia Hirschberg,et al.  Automatic ToBI prediction and alignment to speed manual labeling of prosody , 2001, Speech Commun..

[25]  David Escudero Mancebo,et al.  On the automatic toBI accent type identification from data , 2010, INTERSPEECH.

[26]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[27]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[28]  Nick Campbell Autolabelling Japanese ToBI , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[29]  P Taylor,et al.  Analysis and synthesis of intonation using the Tilt model. , 2000, The Journal of the Acoustical Society of America.

[30]  Carlos Vivaracho-Pascual,et al.  Improving ANN performance for imbalanced data sets by means of the NTIL technique , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[31]  G. Ayers,et al.  Guidelines for ToBI labelling , 1994 .

[32]  Arun Ross,et al.  Score normalization in multimodal biometric systems , 2005, Pattern Recognit..

[33]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[34]  Gina-Anne Levow,et al.  Context in multi-lingual tone and pitch accent recognition , 2005, INTERSPEECH.

[35]  Pedro M. Domingos,et al.  Tree Induction for Probability-Based Ranking , 2003, Machine Learning.

[36]  Paul Taylor,et al.  Text-to-Speech Synthesis , 2009 .

[37]  Robert Tibshirani,et al.  Classification by Pairwise Coupling , 1997, NIPS.

[38]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[39]  Hermann Ney,et al.  On the Probabilistic Interpretation of Neural Network Classifiers and Discriminative Training Criteria , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[40]  Javier Ortega-Garcia,et al.  Extracting the Most Discriminant Subset from a Pool of Candidates to Optimize Discriminant Classifier Training , 2003, ISMIS.

[41]  Mari Ostendorf,et al.  Prediction of abstract prosodic labels for speech synthesis , 1996, Comput. Speech Lang..

[42]  Andrew Rosenberg,et al.  AutoBI - a tool for automatic toBI annotation , 2010, INTERSPEECH.

[43]  Björn Granström,et al.  Developments and paradigms in intonation research , 2001, Speech Commun..

[44]  Andrew Rosenberg,et al.  Automatic detection and classification of prosodic events , 2009 .

[45]  Shrikanth S. Narayanan,et al.  Exploiting Acoustic and Syntactic Features for Automatic Prosody Labeling in a Maximum Entropy Framework , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[46]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[47]  Mark Hasegawa-Johnson,et al.  An automatic prosody labeling system using ANN-based syntactic-prosodic model and GMM-based acoustic-prosodic model , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[48]  Andrew Rosenberg,et al.  Symbolic and Direct Sequential Modeling of Prosody for Classification of Speaking-Style and Nativeness , 2011, INTERSPEECH.

[49]  Franco Scarselli,et al.  Are Multilayer Perceptrons Adequate for Pattern Recognition and Verification? , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[50]  David Escudero Mancebo,et al.  Corpus based extraction of quantitative prosodic parameters of stress groups in Spanish , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[51]  Ah Chung Tsoi,et al.  Neural Network Classification and Prior Class Probabilities , 1996, Neural Networks: Tricks of the Trade.

[52]  Andrew Rosenberg Classification of Prosodic Events using Quantized Contour Modeling , 2010, HLT-NAACL.