TypeTalker: A Speech Synthesis-Based Multi-Modal Commenting System

Speech commenting systems have been shown to facilitate asynchronous online communication from educational discussion to writing feedback. However, the production of speech comments introduces several challenges to users, including overcoming self-consciousness and time consuming editing. In this paper, we introduce TypeTalker, a speech commenting interface that presents speech as a synthesized generic voice to reduce speaker self-consciousness, while retaining the expressivity of the original speech with natural breaks and co-expressive gestures. TypeTalker streamlines speech editing through a simple textbox that respects temporal alignment across edits. A comparative evaluation shows that TypeTalker reduces speech anxiety during live-recording, and offers easier and more effective speech editing facilities than the previous state-of-the-art interface technique. A follow-up study on recipient perceptions of the produced comments suggests that while TypeTalker's generic voice may be traded-off with a loss of personal touch, it can also enhance the clarity of speech by refining the original speech's speed and accent.

[1]  Charles S. Carver,et al.  The Self-Consciousness Scale: A revised version for use with general populations. , 1985 .

[2]  Barry Arons,et al.  The audio notebook: paper and pen interaction with structured speech , 2001, CHI.

[3]  Aaron E. Rosenberg,et al.  SCANMail: a voicemail interface that makes speech browsable, readable and searchable , 2002, CHI.

[4]  Elaine Toms,et al.  The effect of speech recognition accuracy rates on the usefulness and usability of webcast archives , 2006, CHI.

[5]  Clifford Nass,et al.  Less visible and wireless: two experiments on the effects of microphone type on users' performance and perception , 2005, CHI.

[6]  Jeremiah Scholl,et al.  A comparison of chat and audio in media rich environments , 2006, CSCW '06.

[7]  Philip Marriott Voice vs Text-based Discussion Forums: an implementation of Wimba Voice Boards , 2002 .

[8]  Jon Trinder,et al.  The Humane Interface: New Directions for Designing Interactive Systems , 2002, Interact. Learn. Environ..

[9]  Tanja Schultz,et al.  Speaker de-identification via voice transformation , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[10]  Wilmot Li,et al.  Tools for placing cuts and transitions in interview video , 2012, ACM Trans. Graph..

[11]  Stephen R. Levine,et al.  The Freestyle System , 1991 .

[12]  Eric Moulines,et al.  Voice transformation using PSOLA technique , 1991, Speech Commun..

[13]  Elena L. Glassman,et al.  RIMES: Embedding Interactive Multimedia Exercises in Lecture Videos , 2015, CHI.

[14]  Steve Whittaker,et al.  Semantic speech editing , 2004, CHI.

[15]  William Buxton,et al.  Boom chameleon: simultaneous capture of 3D viewpoint, voice and gesture annotations on a spatially-aware display , 2002, UIST '02.

[16]  Barry Arons,et al.  SpeechSkimmer: interactively skimming recorded speech , 1993, UIST '93.

[17]  Michael Picheny,et al.  Effects of real-time transcription on non-native speaker's comprehension in computer-mediated communications , 2009, CHI.

[18]  付伶俐 打磨Using Language,倡导新理念 , 2014 .

[19]  Chris Schmandt The intelligent ear: a graphical interface to digital audio , 1981 .

[20]  Julia Hirschberg,et al.  ASR satisficing: the effects of ASR accuracy on speech retrieval , 2000, INTERSPEECH.

[21]  Debby Hindus,et al.  Ubiquitous audio: capturing spontaneous collaboration , 1992, CSCW '92.

[22]  Steve Whittaker,et al.  A preliminary analysis of the products of HCI research, using Pro Forma abstracts , 1994, CHI Conference Companion.

[23]  Robert E. Kraut,et al.  Expressive richness: a comparison of speech and text as media for revision , 1991, CHI.

[24]  Laura A. Dabbish,et al.  Simplifying video editing using metadata , 2002, DIS '02.

[25]  Dongwook Yoon,et al.  Simplified Audio Production in Asynchronous Voice-Based Discussions , 2016, CHI.

[26]  Wilmot Li,et al.  Content-based tools for editing audio stories , 2013, UIST.

[27]  Nicholas Chen,et al.  RichReview++: Deployment of a Collaborative Multi-modal Annotation System for Instructor Feedback and Peer Discussion , 2016, CSCW.

[28]  J. Lofland,et al.  Analyzing Social Settings , 1971 .

[29]  Nicholas Chen,et al.  RichReview: blending ink, speech, and gesture to support collaborative document review , 2014, UIST.

[30]  Jennifer Pearson,et al.  PaperChains: Dynamic Sketch+Voice Annotations , 2015, CSCW.

[31]  S. Hart,et al.  Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research , 1988 .

[32]  Peter Kulchyski and , 2015 .

[33]  Walter Bender,et al.  Improving speech playback using time-compression and speech recognition , 2004, CHI.

[34]  Michael Agar,et al.  Culture: Can you Take it Anywhere? , 2006 .

[35]  Don Coppersmith,et al.  Discrete logarithms inGF(p) , 2005, Algorithmica.

[36]  R. Fruchter,et al.  RECALL in Action , 2000 .

[37]  Khe Foon Hew,et al.  Audio-based versus text-based asynchronous online discussion: two case studies , 2013 .

[38]  Ian Flood,et al.  Computing in Civil and Building Engineering , 2014 .

[39]  Laurent Denoue,et al.  Moving markup: repositioning freeform annotations , 2002, UIST '02.

[40]  Bill N. Schilit,et al.  Dynomite: a dynamically organized ink and audio notebook , 1997, CHI.

[41]  John W. Creswell,et al.  Designing and Conducting Mixed Methods Research , 2006 .

[42]  A. Strauss,et al.  The Discovery of Grounded Theory , 1967 .

[43]  Jonathan Grudin,et al.  Why CSCW applications fail: problems in the design and evaluationof organizational interfaces , 1988, CSCW '88.

[44]  Marc Schröder,et al.  Emotional speech synthesis: a review , 2001, INTERSPEECH.

[45]  Joseph A. Maxwell,et al.  Qualitative Research Design: An Interactive Approach , 1996 .

[46]  G. D. Liveing,et al.  The University of Cambridge , 1897, British medical journal.

[47]  Steve Whittaker,et al.  Shared Workspaces: How Do They Work and When Are They Useful? , 1993, Int. J. Man Mach. Stud..

[48]  Tara N. Sainath,et al.  Joint training of convolutional and non-convolutional neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Stephen Ades,et al.  Voice Annotation and Editing in a Workstation Environment , 2010 .

[50]  Richard L. Daft,et al.  Organizational information requirements, media richness and structural design , 1986 .

[51]  Stephen DiVerdi,et al.  Cute: A concatenative method for voice conversion using exemplar-based unit selection , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[52]  P. Holzman,et al.  The voice as a percept. , 1966, Journal of personality and social psychology.