DNN-SPACE: DNN-HMM-Based Generative Model of Voice F0 Contours for Statistical Phrase/Accent Command Estimation

This paper proposes a method to extract prosodic features from a speech signal by leveraging auxiliary linguistic information. A prosodic feature extractor called the statistical phrase/accent command estimation (SPACE) has recently been proposed. This extractor is based on a statistical model formulated as a stochastic counterpart of the Fujisaki model, a wellfounded mathematical model representing the control mechanism of vocal fold vibration. The key idea of this approach is that a phrase/accent command pair sequence is modeled as an output sequence of a path-restricted hidden Markov model (HMM) so that estimating the state transition amounts to estimating the phrase/accent commands. Since the phrase and accent commands are related to linguistic information, we may expect to improve the command estimation accuracy by using them as auxiliary information for the inference. To model the relationship between the phrase/accent commands and linguistic information, we construct a deep neural network (DNN) that maps the linguistic feature vectors to the state posterior probabilities of the HMM. Thus, given a pitch contour and linguistic information, we can estimate phrase/accent commands via state decoding. We call this method “DNN-SPACE.” Experimental results revealed that using linguistic information was effective in improving the command estimation accuracy.

[1]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[2]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[3]  Sumio Ohno,et al.  Analysis and modeling of fundamental frequency contours of English utterances , 1995, EUROSPEECH.

[4]  Hirokazu Kameoka,et al.  Speech prosody generation for text-to-speech synthesis based on generative model of F0 contours , 2014, INTERSPEECH.

[5]  Francesco Palmieri,et al.  Inversion of F/sub 0/ model for natural-sounding speech synthesis , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[6]  Kou Tanaka,et al.  Statistical F0 prediction for electrolaryngeal speech enhancement considering generative process of F0 contours within product of experts framework , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Sumio Ohno,et al.  Analysis of accent and intonation in Spanish based on a quantitative model , 1994, ICSLP.

[8]  Hirokazu Kameoka,et al.  Hidden Markov Convolutive Mixture Model for Pitch Contour Analysis of Speech , 2012, INTERSPEECH.

[9]  Hiroshi Murata,et al.  Analysis and modeling of word accent and sentence intonation in Swedish , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Shigeru Katagiri,et al.  ATR Japanese speech database as a tool of speech recognition and synthesis , 1990, Speech Commun..

[12]  Keikichi Hirose,et al.  A method for automatic extraction of model parameters from fundamental frequency contours of speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Keikichi Hirose,et al.  Analysis of voice fundamental frequency contours for declarative sentences of Japanese , 1984 .

[14]  Sumio Ohno,et al.  Analysis of fundamental frequency contours of standard Chinese in terms of the command-response model and its application to synthesis by rule of intonation , 2000, INTERSPEECH.

[15]  Hansjörg Mixdorff,et al.  Analysis of voice fundamental frequency contours of German utterances using a quantitative model , 1994, ICSLP.

[16]  Sumio Ohno,et al.  Analysis and modeling of f_0 contours of portuguese utterances based on the command-response model , 2003, INTERSPEECH.

[17]  Keikichi Hirose,et al.  Use of linguistic information for automatic extraction of f_0 contour generation process model parameters , 2003, INTERSPEECH.

[18]  Hansjörg Mixdorff,et al.  Linguistically motivated parameter estimation methods for a superpositional intonation model , 2014, EURASIP J. Audio Speech Music. Process..

[19]  Jorge A. Gurlekian,et al.  PARAMETER ESTIMATION AND PREDICTION FROM TEXT FOR A SUPERPOSITIONAL INTONATION MODEL , 2019 .

[20]  Hirokazu Kameoka,et al.  Generative Modeling of Voice Fundamental Frequency Contours , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Hansjörg Mixdorff,et al.  A novel approach to the fully automatic extraction of Fujisaki model parameters , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[22]  Hirokazu Kameoka,et al.  A statistical model of speech F0 contours , 2010, SAPA@INTERSPEECH.

[23]  Hansjörg Mixdorff,et al.  Comparison of Fujisaki-model extractors and F0 stylizers , 2009, INTERSPEECH.

[24]  Keikichi Hirose,et al.  Corpus-based extraction of F0 contour generation process model parameters , 2005, INTERSPEECH.

[25]  Francesco Palmieri,et al.  A Method for Automatic Extraction of Fujisaki-Model Parameters , 2002 .

[26]  Wentao Gu,et al.  The command-response model for the generation of F/sub 0/ contours of Cantonese utterances , 2004, Proceedings 7th International Conference on Signal Processing, 2004. Proceedings. ICSP '04. 2004..