A Contextual Study of Semantic Speech Editing in Radio Production

Abstract Radio production involves editing speech-based audio using tools that represent sound using simple waveforms. Semantic speech editing systems allow users to edit audio using an automatically generated transcript, which has the potential to improve the production workflow. To investigate this, we developed a semantic audio editor based on a pilot study. Through a contextual qualitative study of five professional radio producers at the BBC, we examined the existing radio production process and evaluated our semantic editor by using it to create programmes that were later broadcast. We observed that the participants in our study wrote detailed notes about their recordings and used annotation to mark which parts they wanted to use. They collaborated closely with the presenter of their programme to structure the contents and write narrative elements. Participants reported that they often work away from the office to avoid distractions, and print transcripts so they can work away from screens. They also emphasised that listening is an important part of production, to ensure high sound quality. We found that semantic speech editing with automated speech recognition can be used to improve the radio production workflow, but that annotation, collaboration, portability and listening were not well supported by current semantic speech editing systems. In this paper, we make recommendations on how future semantic speech editing systems can better support the requirements of radio production.

[1]  Georgios Tziritas,et al.  A speech/music discriminator based on RMS and zero-crossings , 2005, IEEE Transactions on Multimedia.

[2]  Steve Whittaker,et al.  Semantic speech editing , 2004, CHI.

[3]  Walter Bender,et al.  Improving speech playback using time-compression and speech recognition , 2004, CHI.

[4]  Julia Hirschberg,et al.  SCAN: designing and evaluating user interfaces to support retrieval from speech archives , 1999, SIGIR '99.

[5]  Nadir Weibel,et al.  Paperproof: a paper-digital proof-editing system , 2008, CHI Extended Abstracts.

[6]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[7]  Mark D. Plumbley,et al.  Use of Audio Editors in Radio Production , 2015 .

[8]  Jakob Nielsen,et al.  A mathematical model of the finding of usability problems , 1993, INTERCHI.

[9]  Nicholas Chen,et al.  RichReview: blending ink, speech, and gesture to support collaborative document review , 2014, UIST.

[10]  Douglas W. Oard,et al.  Searching large collections of recorded speech: A preliminary study , 2005, ASIST.

[11]  Frédo Durand,et al.  Dynamic Authoring of Audio with Linked Scripts , 2016, UIST.

[12]  Laura A. Dabbish,et al.  SILVER: simplifying video editing with metadata , 2003, CHI Extended Abstracts.

[13]  Dongwook Yoon,et al.  Simplified Audio Production in Asynchronous Voice-Based Discussions , 2016, CHI.

[14]  J. Wolfe,et al.  What attributes guide the deployment of visual attention and how do they do it? , 2004, Nature Reviews Neuroscience.

[15]  Chengzheng Sun,et al.  Operational transformation for collaborative word processing , 2004, CSCW.

[16]  Wilmot Li,et al.  Content-based tools for editing audio stories , 2013, UIST.

[17]  S. Hart,et al.  Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research , 1988 .

[18]  Brian Amento,et al.  Error correction of voicemail transcripts in SCANMail , 2006, CHI.

[19]  Mark J. Perry,et al.  Temporal hybridity: footage with instant replay in real time , 2010, CHI.

[20]  Barry Arons,et al.  SpeechSkimmer: a system for interactively skimming recorded speech , 1997, TCHI.

[21]  Laura A. Dabbish,et al.  Simplifying video editing using metadata , 2002, DIS '02.

[22]  Mark J. F. Gales,et al.  The MGB challenge: Evaluating multi-genre broadcast media recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[23]  Jörn Loviscach The Quintessence of a Waveform: Focus and Context for Audio Track Displays , 2011 .

[24]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[25]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Mark J. Perry,et al.  Lean collaboration through video gestures: co-ordinating the production of live televised sport , 2009, CHI.

[27]  Scott R. Klemmer,et al.  Books with voices: paper transcripts as a physical interface to oral histories , 2003, CHI '03.

[28]  Alexander H. Waibel,et al.  Multimodal error correction for speech user interfaces , 2001, TCHI.

[29]  Aaron E. Rosenberg,et al.  SCANMail: a voicemail interface that makes speech browsable, readable and searchable , 2002, CHI.

[30]  Wilmot Li,et al.  Tools for placing cuts and transitions in interview video , 2012, ACM Trans. Graph..