Digital Speech Makeup: Voice Conversion Based Altered Auditory Feedback for Transforming Self-Representation

Makeup (i.e., cosmetics) has long been used to transform not only one’s appearance but also their self-representation. Previous studies have demonstrated that visual transformations can induce a variety of effects on self-representation. Herein, we introduce Digital Speech Makeup (DSM), the novel concept of using voice conversion (VC) based auditory feedback to transform human self-representation. We implemented a proof-of-concept system that leverages a state-of-the-art algorithm for near real-time VC and bone-conduction headphones for resolving speech disruptions caused by delayed auditory feedback. Our user study confirmed that conversing for a few dozen minutes using the system influenced participants’ speech ownership and implicit bias. Furthermore, we reviewed the participants’ comments about the experience of DSM and gained additional qualitative insight into possible future directions for the concept. Our work represents the first step towards utilizing VC to design various interpersonal interactions, centered on influencing the users’ psychological state.

[1]  Shinnosuke Takamichi,et al.  TransVoice: Real-Time Voice Conversion for Augmenting Near-Field Speech Communication , 2019, UIST.

[2]  Maria Christofi,et al.  Virtual reality for inducing empathy and reducing prejudice towards stigmatized groups: A survey , 2017, 2017 23rd International Conference on Virtual System & Multimedia (VSMM).

[3]  Katsumi Watanabe,et al.  Covert digital manipulation of vocal emotion alter speakers’ emotional states in a congruent direction , 2016, Proceedings of the National Academy of Sciences.

[4]  Richard Corson,et al.  Fashions in Makeup: From Ancient to Modern Times , 1972 .

[5]  Hao Wang,et al.  Phonetic posteriorgrams for many-to-one voice conversion without parallel data training , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[6]  Mel Slater,et al.  Virtually Being Einstein Results in an Improvement in Cognitive Task Performance and a Decrease in Age Bias , 2018, Front. Psychol..

[7]  A. Packman,et al.  Altered auditory feedback and the treatment of stuttering: a review. , 2006, Journal of fluency disorders.

[8]  Tabitha C. Peck,et al.  Avatar Embodiment. Towards a Standardized Questionnaire , 2018, Front. Robot. AI.

[9]  Shinnosuke Takamichi,et al.  Implementation of DNN-based real-time voice conversion and its improvements by audio data augmentation and mask-shaped device , 2019, 10th ISCA Workshop on Speech Synthesis (SSW 10).

[10]  L. C. Miller,et al.  For Appearances' Sake , 1982 .

[11]  Maria V. Sanchez-Vives,et al.  Virtual Hand Illusion Induced by Visuomotor Correlations , 2010, PloS one.

[12]  Shinnosuke Takamichi,et al.  Non-Parallel Voice Conversion Using Variational Autoencoders Conditioned by Phonetic Posteriorgrams and D-Vectors , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Neil Burgess,et al.  A mechanistic account of bodily resonance and implicit bias , 2019, Cognition.

[14]  M. Slater,et al.  Illusory ownership of a virtual child body causes overestimation of object sizes and implicit attitude changes , 2013, Proceedings of the National Academy of Sciences.

[15]  Mel Slater,et al.  Drumming in Immersive Virtual Reality: The Body Shapes the Way We Play , 2013, IEEE Transactions on Visualization and Computer Graphics.

[16]  Donna Z. Davis,et al.  Digital identities – overcoming visual bias through virtual embodiment , 2018, Information, Communication & Society.

[17]  Marco Liuni,et al.  DAVID: An open-source platform for real-time transformation of infra-segmental emotional cues in running speech , 2017, Behavior Research Methods.

[18]  Tomoki Toda,et al.  Augmented speech production based on real-time statistical voice conversion , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[19]  J. Bailenson,et al.  The Proteus Effect: The Effect of Transformed Self-Representation on Behavior , 2007 .

[20]  J. Bailenson,et al.  Building long-term empathy: A large-scale comparison of traditional and virtual reality perspective-taking , 2018, PloS one.

[21]  Alexander Travis Adams,et al.  Mindless computing: designing technologies to subtly influence behavior , 2015, UbiComp.

[22]  A. J. Yates Delayed auditory feedback. , 1963 .

[23]  V. Groom,et al.  The influence of racial embodiment on racial bias in immersive virtual environments , 2009 .

[24]  M. Slater,et al.  Embodiment in a Child-Like Talking Virtual Body Influences Object Size Perception, Self-Identification, and Subsequent Real Speaking , 2017, Scientific Reports.

[25]  Ulrike Schultze,et al.  Embodiment and presence in virtual worlds: a review , 2010, J. Inf. Technol..

[26]  Mel Slater,et al.  Virtual Embodiment of White People in a Black Virtual Body Leads to a Sustained Reduction in Their Implicit Racial Bias , 2016, Front. Hum. Neurosci..

[27]  Jonathan D. Cohen,et al.  Rubber hands ‘feel’ touch that eyes see , 1998, Nature.

[28]  J. W. Black,et al.  The effect of delayed side-tone upon vocal rate and intensity. , 1951, The Journal of speech disorders.

[29]  A. Greenwald,et al.  Measuring individual differences in implicit cognition: the implicit association test. , 1998, Journal of personality and social psychology.

[30]  Simon King,et al.  An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Rébecca Kleinberger,et al.  COMPANIONS : EVALUATING THE EFFECTS OF MUSICALLY MODULATED AUDITORY FEEDBACK ON THE VOICE , 2019 .

[32]  Yannis Stylianou,et al.  Voice Transformation: A survey , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  Marie Postma,et al.  Exploring a Voice Illusion , 2019, 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII).

[34]  D. Corey,et al.  Delayed auditory feedback effects during reading and conversation tasks: gender differences in fluent adults. , 2008, Journal of fluency disorders.

[35]  B. Lee Artificial stutter. , 1951, The Journal of speech disorders.

[36]  Hiromu Yakura,et al.  Mindless Attractor: A False-Positive Resistant Intervention for Drawing Attention Using Auditory Perturbation , 2021, CHI.

[37]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[38]  Hideki Kawahara,et al.  Transformed auditory feedback: Effects of fundamental frequency perturbation , 1993 .

[39]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[40]  C. Larson,et al.  Voice F0 responses to pitch-shifted auditory feedback: a preliminary study. , 1997, Journal of voice : official journal of the Voice Foundation.

[41]  Brian A. Nosek,et al.  Understanding and using the implicit association test: I. An improved scoring algorithm. , 2003, Journal of personality and social psychology.

[42]  Kristine L. Nowak,et al.  Avatars and computer-mediated communication: a review of the definitions, uses, and effects of digital representations , 2018 .

[43]  Yi Yang,et al.  Investigating Implicit Gender Bias and Embodiment of White Males in Virtual Reality with Full Body Visuomotor Synchrony , 2019, CHI.

[44]  J. Bailenson,et al.  Walk A Mile in Digital Shoes : The Impact of Embodied Perspective-Taking on The Reduction of Negative Stereotyping in Immersive Virtual Environments , 2006 .

[45]  Marjolein P. M. Kammers,et al.  What is embodiment? A psychometric approach , 2008, Cognition.

[46]  Tomoki Toda,et al.  Implementation of Computationally Efficient Real-Time Voice Conversion , 2012, INTERSPEECH.

[47]  Mary Czerwinski,et al.  Regulating Feelings During Interpersonal Conflicts by Changing Voice Self-perception , 2018, CHI.

[48]  A. Toyomura,et al.  Altered auditory feedback perception following an 8-week mindfulness meditation practice. , 2019, International journal of psychophysiology : official journal of the International Organization of Psychophysiology.

[49]  Shuichi Itahashi,et al.  JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research , 1999 .

[50]  Roz Ivanič,et al.  I am how I sound , 2001 .

[51]  Manos Tsakiris,et al.  Experiencing ownership over a dark-skinned body reduces implicit racial bias , 2013, Cognition.

[52]  Kevin G. Munhall,et al.  Perceiving a Stranger's Voice as Being One's Own: A ‘Rubber Voice’ Illusion? , 2011, PloS one.

[53]  Tabitha C. Peck,et al.  Putting yourself in the skin of a black avatar reduces implicit racial bias , 2013, Consciousness and Cognition.

[54]  Jeremy N. Bailenson,et al.  Does the Mask Govern the Mind?: Effects of Arbitrary Gender Representation on Quantitative Task Performance in Avatar-Represented Virtual Groups , 2014, Cyberpsychology Behav. Soc. Netw..

[55]  Veronica S. Pantelidis,et al.  Reasons to Use Virtual Reality in Education and Training Courses and a Model to Determine When to Use Virtual Reality. , 2010 .

[56]  Zhizheng Wu,et al.  Multidimensional scaling of systems in the Voice Conversion Challenge 2016 , 2016, SSW.