Synthesizing a human-like voice is the easy way

Deep-learning based technologies produce speech that is almost indistinguishable from humans. However, focusing on producing human-like voices poses ethical, security and societal issues. Considering the flexibility and the regression power of new technologies based on deep-learning, it is now time to consider a new type of synthesis: natural non-human-like speech synthesis. This paper aims to convince you that such research opens new research directions, that it brings another perspective to address human-like speech challenges, and that enough material is available to start to investigate non-human-like speech.

[1]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[2]  Ilaria Torre,et al.  Trust in artificial voices: A "congruency effect" of first impressions and behavioural experience , 2018, APAScience.

[3]  Benjamin R. Cowan,et al.  CUI@CHI: Mapping Grand Challenges for the Conversational User Interface Community , 2020, CHI Extended Abstracts.

[4]  Simon King,et al.  Measuring the Cognitive Load of Synthetic Speech Using a Dual Task Paradigm , 2018, INTERSPEECH.

[5]  Erik Marchi,et al.  Whispered and Lombard Neural Speech Synthesis , 2021, 2021 IEEE Spoken Language Technology Workshop (SLT).

[6]  Mark West,et al.  I'd blush if I could: closing gender divides in digital skills through education , 2019 .

[7]  Cassia Valentini-Botinhao,et al.  Intelligibility-Enhancing Speech Modifications - The Hurricane Challenge 2.0 , 2020, INTERSPEECH.

[8]  Gestural Song Form in Experimental Vocal Music , 2019, Performance Research.

[9]  Satoshi Nakamura,et al.  Speech Quality Evaluation of Synthesized Japanese Speech Using EEG , 2019, INTERSPEECH.

[10]  Sébastien Le Maguer,et al.  Speech Synthesis Evaluation — State-of-the-Art Assessment and Suggestion for a Novel Research Program , 2019, 10th ISCA Workshop on Speech Synthesis (SSW 10).

[11]  Petra Wagner,et al.  Interactive Hesitation Synthesis: Modelling and Evaluation , 2018 .

[12]  Megan L. Lavengood,et al.  A New Approach to the Analysis of Timbre , 2017 .

[13]  Tomi Kinnunen,et al.  ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection , 2019, INTERSPEECH.

[14]  Simon King,et al.  Measuring the contribution to cognitive load of each predicted vocoder speech parameter in DNN-based speech synthesis , 2019 .

[15]  David Clark,et al.  High Resolution Subjective Testing Using a Double Blind Comparator , 1981 .

[16]  J. Trouvain,et al.  COMPREHENSION OF ULTRA-FAST SPEECH - BLIND VS. "NORMALLY HEARING" PERSONS , 2007 .

[17]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).