Using automatic stress extraction from audio for improved prosody modelling in speech synthesis

Generating proper and natural sounding prosody is one of the key interests of today’s speech synthesis research. An important factor in this effort is the availability of a precisely labelled speech corpus with adequate prosodic stress marking. Obtaining such a labelling constitutes a huge effort, whereas interannotator agreement scores are usually found far below 100%. Stress marking based on phonetic transcription is an alternative, but yields even poorer quality than human annotation. Applying an automatic labelling may help overcoming these difficulties. The current paper presents an automatic approach for stress detection based purely on audio, which is used to derive an automatic, layered labelling of stress events and link them to syllables. For proof of concept, a speech corpus was extended by the output of the stress detection algorithm and a HMM-TTS system was trained with the extended corpus. Results are compared to a baseline system, trained on the same database, but with stress marking obtained from textual transcripts after applying a set of linguistic rules. The evaluation includes CMOS tests and the analysis of the decision trees. Results show an overall improvement in prosodic properties of the synthesized speech. Subjective ratings reveal a voice perceived as more natural.

[1]  Sanjeev Khudanpur,et al.  A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Géza Németh,et al.  Profivox—A Hungarian Text-to-Speech System for Telecommunications Applications , 2000, Int. J. Speech Technol..

[3]  Shrikanth S. Narayanan,et al.  Combining acoustic, lexical, and syntactic evidence for automatic unsupervised prosody labeling , 2006, INTERSPEECH.

[4]  Géza Németh,et al.  Improvements of Hungarian Hidden Markov Model-based Text-to-Speech Synthesis , 2010, Acta Cybern..

[5]  W. Levelt,et al.  Speaking: From Intention to Articulation , 1990 .

[6]  Elisabeth Selkirk The Syntax‐Phonology Interface , 2011 .

[7]  Per Olav Heggtveit,et al.  Automatic prosody labeling of read norwegian , 2004, INTERSPEECH.

[8]  Per Olav Heggtveit,et al.  Automatic Prosody Labelling of read Norwegian , 2004 .

[9]  Julia Hirschberg,et al.  Evaluation of prosodic transcription labeling reliability in the tobi framework , 1994, ICSLP.

[10]  Ann K. Syrdal,et al.  Perceptually based automatic prosody labeling and prosodically enriched unit selection improve concatenative text-to-speech synthesis , 2000, INTERSPEECH.

[11]  Ilona Koutny Parsing hungarian sentences in order to determine their prosodic structures in a multilingual TTS system , 1999, EUROSPEECH.

[12]  Gyorgy Szaszak,et al.  Combining NLP techniques and acoustic analysis for semantic focus detection in speech , 2014, 2014 5th IEEE Conference on Cognitive Infocommunications (CogInfoCom).

[13]  András Beke,et al.  Exploiting Prosody for Syntactic Analysis in Automatic Speech Understanding , 2012, J. Lang. Model..

[14]  Petra Wagner,et al.  On automatic prominence detection for German , 2007, INTERSPEECH.

[15]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .