Testing the GlórCáil System in a Speaker and Affect Voice Transformation Task

This paper describes the results of a voice transformation task experiment conducted as part of the evaluation of a speech synthesis system (the GlórCáil system, also described). The participants were required to manipulate the system’s control parameters reflecting changes in voice quality, f0 and vocal tract length of the speaker (VT) in synthetic utterances. A synthetic baseline utterance was manipulated to make it sound like a target speaker (man, woman, child) with affective colouring (sad, angry, no emotion). The control parameters of the system proved useful in modulating speaker characteristics and paralinguistic prosody. The manipulations performed by the participants were mainly in the expected direction. f0 and VT were found to be significant predictors of speaker gender/age, but not of affect. The voice quality related parameter Rd was a significant predictor of affect, but not of speaker identity. Significant interactions of predictors were found for f0 and VT. The control parameter values obtained in this experiment will be used to generate stimuli to test the proposed system when it is integrated into a DNN-based speech synthesis system as part of the ongoing work of the ABAIR project.

[1]  Axel Röbel,et al.  Pitch transposition and breathiness modification using a glottal source model and its adapted vocal-tract filter , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  D. Klatt,et al.  Analysis, synthesis, and perception of voice quality variations among female and male talkers. , 1990, The Journal of the Acoustical Society of America.

[3]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[4]  Per B. Brockhoff,et al.  lmerTest Package: Tests in Linear Mixed Effects Models , 2017 .

[5]  Paavo Alku,et al.  HMM-based Finnish text-to-speech system utilizing glottal inverse filtering , 2008, INTERSPEECH.

[6]  Ailbhe Ní Chasaide,et al.  The role of voice quality in communicating emotion, mood and attitude , 2003, Speech Commun..

[7]  Raymond D. Kent,et al.  Acoustic Analysis of Speech , 2009 .

[8]  Ailbhe Ní Chasaide,et al.  Rd as a Control Parameter to Explore Affective Correlates of the Tense-Lax Continuum , 2017, INTERSPEECH.

[9]  D. Bates,et al.  Fitting Linear Mixed-Effects Models Using lme4 , 2014, 1406.5823.

[10]  Ailbhe Ní Chasaide,et al.  Speech technology as documentation for endangered language preservation: The case of Irish , 2015, ICPhS.

[11]  Unto K. Laine,et al.  Frequency-warped signal processing for audio applications , 2000 .

[12]  Nassima B. Abdelli-Beruh,et al.  Habitual use of vocal fry in young adult female speakers. , 2012, Journal of voice : official journal of the Voice Foundation.

[13]  Ian McLoughlin,et al.  A Spectral Glottal Flow Model for Source-filter Separation of Speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Paavo Alku,et al.  Estimation of the glottal pulseform based on discrete all-pole modeling , 1994, ICSLP.

[15]  Ailbhe Ní Chasaide,et al.  The Role of Voice Quality in the Perception of Prominence in Synthetic Speech , 2019, INTERSPEECH.

[16]  Yi Xu,et al.  Encoding Emotions in Speech with the Size Code , 2009, Phonetica.

[17]  Ailbhe Ní Chasaide,et al.  Voice Source Contribution to Prominence Perception: Rd Implementation , 2018, INTERSPEECH.

[18]  Slava Shechtman,et al.  Semi Parametric Concatenative TTS with Instant Voice Modification Capabilities , 2017, INTERSPEECH.

[19]  Ailbhe Ní Chasaide,et al.  Cross-Speaker Variation in Voice Source Correlates of Focus and Deaccentuation , 2017, INTERSPEECH.

[20]  Ailbhe Ní Chasaide,et al.  The ABAIR Initiative: Bringing Spoken Irish into the Digital Space , 2017, INTERSPEECH.

[21]  H M Hanson,et al.  Glottal characteristics of female speakers: acoustic correlates. , 1997, The Journal of the Acoustical Society of America.

[22]  D. Lüdecke Sjplot - Data Visualization For Statistics In Social Science. , 2018 .

[23]  Paavo Alku,et al.  Glottal wave analysis with Pitch Synchronous Iterative Adaptive Inverse Filtering , 1991, Speech Commun..

[24]  Automating manual user strategies for precise voice source analysis , 2013, Speech Commun..

[25]  J. Liljencrants,et al.  Dept. for Speech, Music and Hearing Quarterly Progress and Status Report a Four-parameter Model of Glottal Flow , 2022 .

[26]  Axel Röbel,et al.  On glottal source shape parameter transformation using a novel deterministic and stochastic speech analysis and synthesis system , 2015, INTERSPEECH.

[27]  H. Schielzeth,et al.  The coefficient of determination R2 and intra-class correlation coefficient from generalized linear mixed-effects models revisited and expanded , 2016, bioRxiv.

[28]  Junichi Yamagishi,et al.  HMM-based speech synthesiser using the LF-model of the glottal source , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Klaus R. Scherer,et al.  Vocal communication of emotion: A review of research paradigms , 2003, Speech Commun..

[30]  Christer Gobl Modelling aspiration noise during phonation using the LF voice source model , 2006, INTERSPEECH.

[31]  Gunnar Fant,et al.  The voice source in connected speech , 1997, Speech Commun..

[32]  Ailbhe Ní Chasaide,et al.  Perceptual Salience of Voice Source Parameters in Signaling Focal Prominence , 2016, INTERSPEECH.