One Voice Fits All?

ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. 2573-0142/2019/11-ART223 $15.00 https://doi.org/10.1145/3359325 Proc. ACM Hum.-Comput. Interact., Vol. 3, No. CSCW, Article 223. Publication date: November 2019. 223:2 Julia Cambre and Chinmay Kulkarni has also advanced considerably in recent years; new models based on deep neural networks such as WaveNet are now capable of generating increasingly varied and more human-sounding speech compared to prior approaches like concatenative or parametric synthesis [30, 63]. This explosion in the popularity and pervasiveness of voice interfaces—along with rapid improvements in speech technology—adds new urgency and complexity to the question Nass and Brave raised nearly 15 years ago. Within the Human-Computer Interaction and Computer-Supported Cooperative Work community, this trend has not gone unnoticed. In recent years, researchers have studied voice assistants from a number of angles. Several papers have explored users’ patterns of everyday use with common voice assistants like Alexa, Siri, and the Google Assistant [3, 44, 66, 70]. Others have considered usability challenges faced by natural language processing errors [58, 74], and future use scenarios such as leveraging speech to navigate videos [13] or promote workplace reflection [36]. There have also been efforts to establish a more theoretical or vision-setting perspective on voice technology: for example, Cohen et al. [18] and Shneiderman [72] have weighed in on the merits of voice as an interaction medium, while Murad et al. [56] proposed an initial set of design guidelines for voice interfaces. Within the CSCW community specifically, voice interactions have also received considerable attention in recent years, with papers and workshops on topics ranging from accessibility [12], to automated meeting support [49], Wizard of Oz prototyping techniques [45], privacy [37], multi-user interaction [67], and more. While these papers all offer useful perspectives on voice interface design, their focus has almost exclusively been on what voice assistants say in conversation, rather than on how they say it. This paper poses a seemingly straightforward question: What should the voices of our smart devices sound like? Specifically, as we move towards a future in which users interact through speech with not just smartphones and smart speakers, but with an increasing array of everyday objects, selecting a voice identity for these smart devices remains an open design challenge with important social consequences. This paper introduces a research framework for understanding the social implications of design decisions in voice design. To demonstrate the utility of this framework, we both summarize existing research using it, and discuss a sampling of new research questions it generates. To generate this framework, we consider the design space of smart device voices, and organize the literature around what we know about how the features of a synthesized voice shape our interactions with speech-enabled technology. In doing so, we rely heavily on research in human-robot interaction (HRI), while still incorporating research from other fields such as social psychology and design research. We are not the first to propose a framework for voice design. For example, Clark et al [16] mapped out the existing space of research on voice in HCI through a recent review of 68 papers. Through this review, the authors suggest a set of open challenges for the field, including a need for further design work and studies of multi-user interaction contexts. Importantly, however, their review deliberately excluded papers focusing on embodied interfaces. Our framework complements Clark et al.’s review by focusing explicitly on this area of embodied voice design. Our HRI-based perspective also distinguishes this paper from recent work that studies the design of speech interfaces with voice in isolation. For example, Sutton et al. propose a framework based on findings from socio-phonetics [71]. While studying voices in isolation prevents the confounding effects of voice with the effects of embodiment, in practice embodiment, form-factor, and contexts of use do indeed influence how people perceive voice interfaces and social robots [23, 28, 34, 50]. In our work, we hold that these attributes are not undesirable confounds, but necessary dimensions of analysis: smart devices necessarily will possess form, contexts of use, and perhaps even human-like embodiment. Thus, because embodiment and form and voice together affect perception is precisely Proc. ACM Hum.-Comput. Interact., Vol. 3, No. CSCW, Article 223. Publication date: November 2019. One Voice Fits All? Social Implications and Research Challenges of Designing Voices for Smart

[1]  Siddhartha S. Srinivasa,et al.  Gracefully mitigating breakdowns in robotic services , 2010, 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[2]  A. Todorov,et al.  How Do You Say ‘Hello’? Personality Impressions from Brief Novel Voices , 2014, PloS one.

[3]  John C. Tang,et al.  More to Meetings: Challenges in Using Speech-Based Technology to Support Meetings , 2017, CSCW.

[4]  Julia Hirschberg,et al.  Deep Personality Recognition for Deception Detection , 2018, INTERSPEECH.

[5]  Shrikanth S. Narayanan,et al.  Improving Gender Identification in Movie Audio Using Cross-Domain Data , 2018, INTERSPEECH.

[6]  Clifford Nass,et al.  Computers are social actors , 1994, CHI '94.

[7]  Susan R. Fussell,et al.  Anthropomorphic Interactions with a Robot and Robot–like Agent , 2008 .

[8]  Shruti Sannon,et al.  "Alexa is my new BFF": Social Roles, User Satisfaction, and Personification of the Amazon Echo , 2017, CHI Extended Abstracts.

[9]  Kerstin Fischer,et al.  Levels of embodiment: Linguistic analyses of factors influencing HRI , 2012, 2012 7th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[10]  Lauren A. Schmidt,et al.  Sex, syntax and semantics. , 2003 .

[11]  Roger K. Moore Appropriate Voices for Artefacts: Some Key Insights , 2017 .

[12]  Maya Cakmak,et al.  Characterizing the Design Space of Rendered Robot Faces , 2018, 2018 13th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[13]  E. Goffman The Presentation of Self in Everyday Life , 1959 .

[14]  C. Nass,et al.  Does computer-synthesized speech manifest personality? Experimental tests of recognition, similarity-attraction, and consistency-attraction. , 2001, Journal of experimental psychology. Applied.

[15]  N. Ambady,et al.  Half a minute: Predicting teacher evaluations from thin slices of nonverbal behavior and physical attractiveness. , 1993 .

[16]  Wendy Ju,et al.  WoZ Way: Enabling Real-time Remote Interaction Prototyping & Observation in On-road Vehicles , 2017, CSCW.

[17]  Juan Manuel Montero-Martínez,et al.  Emotional speech synthesis: from speech database to TTS , 1998, ICSLP.

[18]  Wendy Ju,et al.  Good vibrations: How consequential sounds affect perception of robotic arms , 2017, 2017 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN).

[19]  Guy Deutscher,et al.  Through the Language Glass: Why the World Looks Different in Other Languages , 2010 .

[20]  Abigail Sellen,et al.  "Like Having a Really Bad PA": The Gulf between User Expectation and Experience of Conversational Agents , 2016, CHI.

[21]  B. Fogg,et al.  Motivating, Influencing, and Persuading Users: An Introduction To Captology , 2007 .

[22]  Jennifer Marlow,et al.  Designing for Workplace Reflection: A Chat and Voice-Based Conversational Agent , 2018, Conference on Designing Interactive Systems.

[23]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[24]  Bilge Mutlu,et al.  Task Structure and User Attributes as Elements of Human-Robot Interaction Design , 2006, ROMAN 2006 - The 15th IEEE International Symposium on Robot and Human Interactive Communication.

[25]  Sean Andrist,et al.  Effects of Culture on the Credibility of Robot Speech: A Comparison between English and Arabic , 2015, 2015 10th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[26]  Taezoon Park,et al.  When stereotypes meet robots: The double-edge sword of robot gender and personality in human-robot interaction , 2014, Comput. Hum. Behav..

[27]  Roger K. Moore Is Spoken Language All-or-Nothing? Implications for Future Speech-Based Human-Machine Interaction , 2016, IWSDS.

[28]  Christoph Bartneck,et al.  Robots And Racism , 2018, 2018 13th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[29]  Jichen Zhu,et al.  Patterns for How Users Overcome Obstacles in Voice User Interfaces , 2018, CHI.

[30]  Mark West,et al.  I'd blush if I could: closing gender divides in digital skills through education , 2019 .

[31]  Heather Pon-Barry,et al.  Effects of voice-adaptation and social dialogue on perceptions of a robotic learning companion , 2016, 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[32]  Benjamin R. Cowan,et al.  Siri, Echo and Performance: You have to Suffer Darling , 2019, CHI Extended Abstracts.

[33]  Khalil Sima'an,et al.  Wired for Speech: How Voice Activates and Advances the Human-Computer Relationship , 2006, Computational Linguistics.

[34]  Cassia Valentini-Botinhao,et al.  Are we using enough listeners? no! - an empirically-supported critique of interspeech 2014 TTS evaluations , 2015, INTERSPEECH.

[35]  Astrid M. Rosenthal-von der Pütten,et al.  The Peculiarities of Robot Embodiment (EmCorp-Scale) : Development, Validation and Initial Test of the Embodiment and Corporeality of Artificial Agents Scale , 2018, 2018 13th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[36]  Walter S. Lasecki,et al.  Accessible Voice Interfaces , 2018, CSCW Companion.

[37]  Jason C. Yip,et al.  Communication Breakdowns Between Families and Alexa , 2019, CHI.

[38]  Daniela Karin Rosner,et al.  Broken probes: toward the design of worn media , 2014, Personal and Ubiquitous Computing.

[39]  Leila Takayama,et al.  Help Me Please: Robot Politeness Strategies for Soliciting Help From Humans , 2016, CHI.

[40]  Rana El Kaliouby,et al.  On the Future of Personal Assistants , 2016, CHI Extended Abstracts.

[41]  D. Pillemer,et al.  Children's sex-related stereotyping of colors. , 1990, Child development.

[42]  S. Shyam Sundar,et al.  Feminizing Robots: User Responses to Gender Cues on Robot Body and Screen , 2016, CHI Extended Abstracts.

[43]  Andreea Danielescu,et al.  A Bot is Not a Polyglot: Designing Personalities for Multi-Lingual Conversational Agents , 2018, CHI Extended Abstracts.

[44]  Tom Rodden,et al.  A Multimodal Approach to Assessing User Experiences with Agent Helpers , 2016, ACM Trans. Interact. Intell. Syst..

[45]  Ilaria Torre,et al.  Can you Tell the Robot by the Voice? An Exploratory Study on the Role of Voice in the Perception of Robots , 2019, 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[46]  Wendy Ju,et al.  Making Noise Intentional: A Study of Servo Sound Perception , 2017, 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI.

[47]  Henriette Cramer,et al.  "Play PRBLMS": Identifying and Correcting Less Accessible Content in Voice Interfaces , 2018, CHI.

[48]  John Zimmerman,et al.  Re-Embodiment and Co-Embodiment: Exploration of social presence for robots and conversational agents , 2019, Conference on Designing Interactive Systems.

[49]  Ben Shneiderman,et al.  The limits of speech recognition , 2000, CACM.

[50]  Hsi-Peng Lu,et al.  Stereotypes or golden rules? Exploring likable voice traits of social robots as active aging companions for tech-savvy baby boomers in Taiwan , 2018, Comput. Hum. Behav..

[51]  Meera M. Blattner,et al.  Earcons and Icons: Their Structure and Common Design Principles (Abstract only) , 1989, SGCH.

[52]  Sarah Sharples,et al.  Voice Interfaces in Everyday Life , 2018, CHI.

[53]  C. Judd,et al.  What the Voice Reveals: Within- and Between-Category Stereotyping on the Basis of Voice , 2006, Personality & social psychology bulletin.

[54]  Jens Edlund,et al.  The State of Speech in HCI: Trends, Themes and Challenges , 2018, Interact. Comput..

[55]  Jodi Forlizzi,et al.  "Hey Alexa, What's Up?": A Mixed-Methods Studies of In-Home Conversational Agent Usage , 2018, Conference on Designing Interactive Systems.

[56]  Sarah Sharples,et al.  "Do Animals Have Accents?": Talking with Agents in Multi-Party Conversation , 2017, CSCW.

[57]  Shaun W. Lawson,et al.  Voice as a Design Material: Sociophonetic Inspired Design Strategies in Human-Computer Interaction , 2019, CHI.

[58]  Benjamin R. Cowan,et al.  Design guidelines for hands-free speech interaction , 2018, MobileHCI Adjunct.

[59]  Amy Ogan,et al.  Automated Pitch Convergence Improves Learning in a Social, Teachable Robot for Middle School Mathematics , 2018, AIED.

[60]  William W. Gaver The SonicFinder: An Interface That Uses Auditory Icons , 1989, Hum. Comput. Interact..

[61]  Lone Koefoed Hansen,et al.  Intimate Futures: Staying with the Trouble of Digital Personal Assistants through Design Fiction , 2018, Conference on Designing Interactive Systems.

[62]  Susan R. Fussell,et al.  How a robot should give advice , 2013, 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[63]  Haizhou Li,et al.  Wavelet Analysis of Speaker Dependent and Independent Prosody for Voice Conversion , 2018, INTERSPEECH.

[64]  K. M. Lee,et al.  Can robots manifest personality? : An empirical test of personality recognition, social responses, and social presence in human-robot interaction , 2006 .

[65]  Maneesh Agrawala,et al.  How to Design Voice Based Navigation for How-To Videos , 2019, CHI.

[66]  Elizabeth D. Mynatt,et al.  An architecture for transforming graphical interfaces , 1994, UIST '94.