A uniform phase representation for the harmonic model in speech synthesis applications

Feature-based vocoders, e.g., STRAIGHT, offer a way to manipulate the perceived characteristics of the speech signal in speech transformation and synthesis. For the harmonic model, which provide excellent perceived quality, features for the amplitude parameters already exist (e.g., Line Spectral Frequencies (LSF), Mel-Frequency Cepstral Coefficients (MFCC)). However, because of the wrapping of the phase parameters, phase features are more difficult to design. To randomize the phase of the harmonic model during synthesis, a voicing feature is commonly used, which distinguishes voiced and unvoiced segments. However, voice production allows smooth transitions between voiced/unvoiced states which makes voicing segmentation sometimes tricky to estimate. In this article, two-phase features are suggested to represent the phase of the harmonic model in a uniform way, without voicing decision. The synthesis quality of the resulting vocoder has been evaluated, using subjective listening tests, in the context of resynthesis, pitch scaling, and Hidden Markov Model (HMM)-based synthesis. The experiments show that the suggested signal model is comparable to STRAIGHT or even better in some scenarios. They also reveal some limitations of the harmonic framework itself in the case of high fundamental frequencies.

[1]  Axel Röbel,et al.  Phase Minimization for Glottal Model Estimation , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Yannis Stylianou,et al.  Time-scale modifications based on a full-band adaptive harmonic model , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[4]  Nathalie Henrich Bernardoni,et al.  The spectrum of glottal flow models , 2006 .

[5]  Christophe d'Alessandro,et al.  Analysis/synthesis and modification of the speech aperiodic component , 1996, Speech Commun..

[6]  P. Sprent,et al.  Statistical Analysis of Circular Data. , 1994 .

[7]  F. Itakura,et al.  The effect of group delay spectrum on timbre , 2002 .

[8]  Jordi Bonada HIGH QUALITY VOICE TRANSFORMATIONS BASED ON MODELING RADIATED VOICE PULSES IN FREQUENCY DOMAIN , 2004 .

[9]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[10]  Yamato Ohtani,et al.  HMM-based speech synthesis using sub-band basis spectrum model , 2012, INTERSPEECH.

[11]  Gilles Degottex,et al.  Usual voice quality features and glottal features for emotional valence detection , 2012 .

[12]  Rainer Martin,et al.  Phase estimation for signal reconstruction in single-channel source separation , 2012, INTERSPEECH.

[13]  Bayya Yegnanarayana,et al.  Determination of instants of significant excitation in speech using group delay function , 1995, IEEE Trans. Speech Audio Process..

[14]  Amro El-Jaroudi,et al.  Discrete all-pole modeling , 1991, IEEE Trans. Signal Process..

[15]  Kuldip K. Paliwal,et al.  Product of power spectrum and group delay function for speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Simon King,et al.  Estimation of voice source and vocal tract characteristics based on multi-frame analysis , 2003, INTERSPEECH.

[17]  Villy Hansen,et al.  On Aural Phase Detection: Part 1 , 1974 .

[18]  Mark J. F. Gales,et al.  The Application of Hidden Markov Models in Speech Recognition , 2007, Found. Trends Signal Process..

[19]  Thomas F. Quatieri,et al.  Shape invariant time-scale and pitch modification of speech , 1992, IEEE Trans. Signal Process..

[20]  Mike E. Davies,et al.  IEEE International Conference on Acoustics Speech and Signal Processing , 2008 .

[21]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[22]  Thierry Dutoit,et al.  Phase-based information for voice pathology detection , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Yannis Stylianou,et al.  Analysis and Synthesis of Speech Using an Adaptive Full-Band Harmonic Model , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  B. Yegnanarayana,et al.  Epoch extraction from linear prediction residual for identification of closed glottis interval , 1979 .

[25]  Nicholas I. Fisher,et al.  Statistical Analysis of Circular Data , 1993 .

[26]  R. Miller Nature of the Vocal Cord Wave , 1956 .

[27]  Inma Hernáez,et al.  HNM-based MFCC+F0 extractor applied to statistical speech synthesis , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  R. J. McAulay,et al.  Speech transformations based on a sinusoidal representation , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  A. Oppenheim,et al.  Nonlinear filtering of multiplied and convolved signals , 1968 .

[30]  Pejman Mowlaee,et al.  Iterative Closed-Loop Phase-Aware Single-Channel Speech Enhancement , 2013, IEEE Signal Processing Letters.

[31]  Inma Hernáez,et al.  Harmonics Plus Noise Model Based Vocoder for Statistical Parametric Speech Synthesis , 2014, IEEE Journal of Selected Topics in Signal Processing.

[32]  Yannis Stylianou,et al.  Evaluating the intelligibility benefit of speech modifications in known noise conditions , 2013, Speech Commun..

[33]  Hideki Kawahara,et al.  Auditory Adaptation in Voice Perception , 2008, Current Biology.

[34]  Yamato Ohtani,et al.  Continuous F0 in the source-excitation generation for HMM-based TTS: Do we need voiced/unvoiced classification? , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  A. Oppenheim,et al.  Nonlinear filtering of multiplied and convolved signals , 1968 .

[36]  Andreas Spanias,et al.  Speech coding: a tutorial review , 1994, Proc. IEEE.

[37]  Kai Yu,et al.  Continuous F0 Modeling for HMM Based Statistical Parametric Speech Synthesis , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[38]  Christophe d'Alessandro,et al.  Zeros of z-transform (ZZT) decomposition of speech for source-tract separation , 2004, INTERSPEECH.

[39]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[40]  B. Yegnanarayana,et al.  Significance of group delay functions in signal reconstruction from spectral magnitude or phase , 1984 .

[41]  Jean Laroche,et al.  Improved phase vocoder time-scale modification of audio , 1999, IEEE Trans. Speech Audio Process..

[42]  Christophe d'Alessandro,et al.  The voice source as a causal/anticausal linear filter , 2003 .

[43]  Thomas F. Quatieri,et al.  Sine-wave phase coding at low data rates , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[44]  Heiga Zen,et al.  Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005 , 2007, IEICE Trans. Inf. Syst..

[45]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[46]  Bayya Yegnanarayana,et al.  Speech processing using group delay functions , 1991, Signal Process..

[47]  Axel Röbel,et al.  Mixed source model and its adapted vocal tract filter estimate for voice transformation and synthesis , 2013, Speech Commun..

[48]  Axel Röbel,et al.  Function of Phase-Distortion for glottal model estimation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[50]  Yannis Stylianou,et al.  Adaptive AM–FM Signal Decomposition With Application to Speech Analysis , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[51]  Eric Moulines,et al.  Estimation of the spectral envelope of voiced sounds using a penalized likelihood approach , 2001, IEEE Trans. Speech Audio Process..

[52]  Thomas F. Quatieri,et al.  Phase coherence in speech reconstruction for enhancement and coding applications , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[53]  Ibon Saratxaga,et al.  Perceptual Importance of the Phase Related Information in Speech , 2012, INTERSPEECH.

[54]  Thierry Dutoit,et al.  Complex cepstrum-based decomposition of speech for glottal source estimation , 2009, INTERSPEECH.

[55]  Yannis Stylianou Removing linear phase mismatches in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[56]  Hideki Kawahara,et al.  Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.

[57]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[58]  Mike Brookes,et al.  Estimation of Glottal Closure Instants in Voiced Speech Using the DYPSA Algorithm , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[59]  H. Saunders,et al.  Digital Signal Processing (2nd Edition) , 1988 .

[60]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[61]  Alan V. Oppenheim,et al.  Digital Signal Processing , 1978, IEEE Transactions on Systems, Man, and Cybernetics.

[62]  Eric Moulines,et al.  A diphone synthesis system based on time-domain prosodic modifications of speech , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[63]  R. McAulay,et al.  "Multirate sinusoidal transform coding at rates from 2.4 kbps to 8 kbps" , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[64]  Ibon Saratxaga,et al.  Detection of synthetic speech for the problem of imposture , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[65]  Yannis Stylianou,et al.  Wrapped Gaussian Mixture Models for Modeling and High-Rate Quantization of Phase Data of Speech , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[66]  John Vanderkooy,et al.  On the Audibility of Midrange Phase Distortion in Audio Systems , 1980 .

[67]  Mark J. F. Gales,et al.  Complex cepstrum for statistical parametric speech synthesis , 2013, Speech Commun..

[68]  Akihiko Sugiyama,et al.  Phase randomization - A new paradigm for single-channel signal enhancement , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[69]  Eric Moulines,et al.  HNS: Speech modification based on a harmonic+noise model , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[70]  D. Paul The spectral envelope estimation vocoder , 1981 .

[71]  I. Saratxaga,et al.  Simple representation of signal phase for harmonic speech models , 2009 .

[72]  Yannis Stylianou,et al.  Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification , 1996 .

[73]  Yannis Stylianou,et al.  Pitch modifications of speech based on an adaptive Harmonic Model , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[74]  Heiga Zen,et al.  Speech Synthesis Based on Hidden Markov Models , 2013, Proceedings of the IEEE.

[75]  Jon Sánchez,et al.  Versatile Speech Databases for High Quality Synthesis for Basque , 2012, LREC.

[76]  Xavier Rodet,et al.  A HMM-based speech synthesis system using a new glottal source and vocal-tract separation method , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[77]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.