The role of glottal source parameters for high-quality transformation of perceptual age

The intuitive control of voice transformation (e.g., age/sex, emotions) is useful to extend the expressive repertoire of a voice. This paper explores the role of glottal source parameters for the control of voice transformation. First, the SVLN speech synthesizer (Separation of the Vocal-tract with the Liljencrants-fant model plus Noise) is used to represent the glottal source parameters (and thus, voice quality) during speech analysis and synthesis. Then, a simple statistical method is presented to control speech parameters during voice transformation: a GMM is used to model the speech parameters of a voice, and regressions are then used to adapt the GMMs statistics (mean and variance) to a control parameter (e.g., age/sex, emotions). A subjective experiment conducted on the control of perceptual age proves the importance of the glottal source parameters for the control of voice transformation, and shows the efficiency of the statistical model to control voice parameters while preserving a high-quality of the voice transformation.

[1]  C. Ferrand Harmonics-to-noise ratio: an index of vocal aging. , 2002, Journal of voice : official journal of the Voice Foundation.

[2]  Takashi Nose,et al.  A technique for controlling voice quality of synthetic speech using multiple regression HSMM , 2006, INTERSPEECH.

[3]  Steve An Xue, Dimitar Deliyski EFFECTS OF AGING ON SELECTED ACOUSTIC VOICE PARAMETERS: PRELIMINARY NORMATIVE DATA AND EDUCATIONAL IMPLICATIONS , 2001 .

[4]  Axel Röbel,et al.  Natural Transformation of Type and Nature of the Voice for Extending Vocal Repertoire in High-Fidelity Applications , 2009 .

[5]  Tomoki Toda,et al.  Adaptive voice-quality control based on one-to-many eigenvoice conversion , 2010, INTERSPEECH.

[6]  Tomoki Toda,et al.  Regression approaches to perceptual age control in singing voice conversion , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  A Russell,et al.  Speaking fundamental frequency changes over time in women: a longitudinal study. , 1995, Journal of speech and hearing research.

[8]  Xavier Rodet,et al.  Intonation Conversion from Neutral to Expressive Speech , 2011, INTERSPEECH.

[9]  Axel Röbel,et al.  Mixed source model and its adapted vocal tract filter estimate for voice transformation and synthesis , 2013, Speech Commun..

[10]  Tomoki Toda,et al.  One-to-Many and Many-to-One Voice Conversion Based on Eigenvoices , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[11]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[12]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[13]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[14]  T. D. Hanley,et al.  Vocal aging. , 1959, Geriatrics.

[15]  R J Baken,et al.  The aged voice: a new hypothesis. , 2005, Journal of voice : official journal of the Voice Foundation.

[16]  Junichi Yamagishi,et al.  HMM-based speech synthesiser using the LF-model of the glottal source , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  L. Ramig,et al.  Effects of physiological aging on selected acoustic characteristics of voice. , 1983, Journal of speech and hearing research.

[18]  S. Linville,et al.  Vocal tract resonance analysis of aging voice using long-term average spectra. , 2001, Journal of voice : official journal of the Voice Foundation.

[19]  H. Hollien,et al.  Speaking fundamental frequency and chronologic age in males. , 1972, Journal of speech and hearing research.

[20]  Tomoki Toda,et al.  An investigation of acoustic features for singing voice conversion based on perceptual age , 2013, INTERSPEECH.

[21]  E. Pellegrino,et al.  “Young” and “Old” Voices: the prosodic auto-transplantation technique for speaker’s age recognition , 2014 .

[22]  Andrea Paoloni,et al.  Subjective age estimation of telephonic voices , 2000, Speech Commun..

[23]  A. Roebel,et al.  Glottal Closure Instant detection from a glottal shape estimate , 2009 .

[24]  Axel Röbel,et al.  Pitch transposition and breathiness modification using a glottal source model and its adapted vocal-tract filter , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[26]  Tomoki Toda,et al.  Statistical approach to voice quality control in esophageal speech enhancement , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Tomoki Toda,et al.  Regression approaches to voice quality controll based on one-to-many eigenvoice conversion , 2007, SSW.