Behavioral Changes in Speakers who are Automatically Captioned in Meetings with Deaf or Hard-of-Hearing Peers

Deaf and hard of hearing (DHH) individuals face barriers to communication in small-group meetings with hearing peers; we examine generation of captions on mobile devices by automatic speech recognition (ASR). While ASR output displays errors, we study whether such tools benefit users and influence conversational behaviors. An experiment was conducted where DHH and hearing individuals collaborated in discussions in three conditions (without an ASR-based application, with the application, and with a version indicating words for which the ASR has low confidence). An analysis of audio recordings, from each participant across conditions, revealed significant differences in speech features. When using the ASR-based automatic captioning application, hearing individuals spoke more loudly, with improved voice quality (harmonics-to-noise ratio), with a non-standard articulation (changes in F1 and F2 formants), and at a faster rate. Identifying non-standard speech in this setting has implications on the composition of data used for ASR training/testing, which should be representative of its usage context. Understanding these behavioral influences may also enable designers of ASR captioning systems to leverage these effects, to promote communication success.

[1]  Geoffrey Zweig,et al.  The microsoft 2016 conversational speech recognition system , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Jon Barker,et al.  The CHiME Challenges: Robust Speech Recognition in Everyday Environments , 2017, New Era for Robust Speech Recognition, Exploiting Deep Learning.

[3]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[4]  Larwan Berke,et al.  Methods for Evaluation of Imperfect Captioning Tools by Deaf or Hard-of-Hearing Users at Different Reading Literacy Levels , 2018, CHI.

[5]  Frank A. Russo,et al.  The influence of vocal training and acting experience on measures of voice quality and emotional genuineness , 2014, Front. Psychol..

[6]  Marco Furini,et al.  Enhancing learning accessibility through fully automatic captioning , 2012, W4A.

[7]  Sara H. Basson,et al.  Speech recognition in university classrooms: liberated learning project , 2002, Assets '02.

[8]  Stefanie Köster Acoustic-phonetic characteristics of hyperarticulated speech for different speaking styles , 2001, ICASSP.

[9]  Mike Wald Crowdsourcing correction of speech recognition captioning errors , 2011, W4A.

[10]  Richard E. Ladner,et al.  Improving Real-Time Captioning Experiences for Deaf and Hard of Hearing Students , 2016, ASSETS.

[11]  M. Tanenhaus,et al.  Dynamically adapted context-specific hyper-articulation: Feedback from interlocutors affects speakers' subsequent pronunciations. , 2016, Journal of memory and language.

[12]  Matt Huenerfauth,et al.  Deaf and Hard of Hearing Individuals' Perceptions of Communication with Hearing Colleagues in Small Groups , 2016, ASSETS.

[13]  Takashi Itoh,et al.  Evaluation of real-time captioning by machine recognition with human support , 2015, W4A.

[14]  Mark Liberman,et al.  Towards an integrated understanding of speaking rate in conversation , 2006, INTERSPEECH.

[15]  Wouter A Dreschler,et al.  Modeling speech intelligibility in quiet and noise in listeners with normal and impaired hearing. , 2010, The Journal of the Acoustical Society of America.

[16]  Lisa B. Elliot,et al.  Personal Perspectives on Using Automatic Speech Recognition to Facilitate Communication between Deaf Students and Hearing Customers , 2017, ASSETS.

[17]  Matt Huenerfauth,et al.  Evaluating the Usability of Automatically Generated Captions for People who are Deaf or Hard of Hearing , 2017, ASSETS.

[18]  Lisa B. Elliot,et al.  User Experiences When Testing a Messaging App for Communication Between Individuals who are Hearing and Deaf or Hard of Hearing , 2017, ASSETS.

[19]  Rein Ove Sikveland,et al.  How do we Speak to Foreigners? – Phonetic Analyses of Speech Communication between L1 and L2 Speakers of Norwegian , 2009 .

[20]  Denis K Burnham,et al.  Computer- and human-directed speech before and after correction , 2010 .

[21]  Larwan Berke,et al.  Deaf and Hard-of-Hearing Perspectives on Imperfect Automatic Speech Recognition for Captioning One-on-One Meetings , 2017, ASSETS.

[22]  Hagen Soltau,et al.  On the influence of hyperarticulated speech on recognition performance , 1998, ICSLP.

[23]  Walter S. Lasecki,et al.  Accessibility Evaluation of Classroom Captions , 2014, ACM Trans. Access. Comput..

[24]  Walter S. Lasecki,et al.  Real-time captioning by groups of non-experts , 2012, UIST.

[25]  Amanda Stent,et al.  Adapting speaking after evidence of misrecognition: Local and global hyperarticulation , 2008, Speech Commun..

[26]  Thierry Dutoit,et al.  Analysis and synthesis of hypo- and hyperarticulated speech , 2010, SSW.

[27]  A. Azzouz 2011 , 2020, City.

[28]  Ira R. Forman,et al.  Blue herd: automated captioning for videoconferences , 2012, ASSETS '12.

[29]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[30]  Pascale Tremblay,et al.  Effects of age on the amplitude, frequency and perceived quality of voice , 2015, AGE.

[31]  S Oviatt,et al.  Modeling global and focal hyperarticulation during human-computer error resolution. , 1998, The Journal of the Acoustical Society of America.