Automatic pitch accent detection using auto-context with acoustic features

In prosody event detection field, many local acoustic features have been proposed for representing the prosody characteristics of speech unit. The context information that represents some possible regularities underlying neighboring prosody events, however, hasn't been used effectively. The main difficulty to utilize prosodic context is that it's hard to capture the long-distance sequential dependency. In order to solve this problem, we introduce a new learning approach: auto-context. In this algorithm, a classifier is first trained based on local acoustic features; the discriminative probabilities produced by the classifier are selected as context information for the next iteration. Then a new classifier is trained by using the selected context information and local acoustic features. Repeating using the updated probabilities as the context information for the next iteration, the algorithm can boost recognition ability during its iterative process until converged. The merit of this method is that it can choose context information in a flexible way, while reserving reliable context information and abandoning unreliable ones. The experimental results showed that the proposed method improved the accuracy by absolutely about 1% for pitch accent detection.

[1]  Paul Taylor,et al.  The tilt intonation model , 1998, ICSLP.

[2]  Zhizheng Wu,et al.  Automatic prosody prediction and detection with Conditional Random Field (CRF) models , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[3]  P Taylor,et al.  Analysis and synthesis of intonation using the Tilt model. , 2000, The Journal of the Acoustical Society of America.

[4]  Yasemin Altun,et al.  Using Conditional Random Fields to Predict Pitch Accents in Conversational Speech , 2004, ACL.

[5]  Paul Taylor,et al.  The rise/fall/connection model of intonation , 1994, Speech Communication.

[6]  Gina-Anne Levow,et al.  Automatic Prosodic Labeling with Conditional Random Fields and Rich Acoustic Features , 2008, IJCNLP.

[7]  Jia Liu,et al.  Automatic lexical stress detection using acoustic features for computer-assisted language learning , 2011 .

[8]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[9]  Shrikanth S. Narayanan,et al.  Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Bhuvana Ramabhadran,et al.  Discriminative training and unsupervised adaptation for labeling prosodic events with limited training data , 2010, INTERSPEECH.

[11]  Zhuowen Tu,et al.  Auto-context and its application to high-level vision tasks , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.