Capture-Time Feedback for Recording Scripted Narration

Well-performed audio narrations are a hallmark of captivating podcasts, explainer videos, radio stories, and movie trailers. To record these narrations, professional voiceover actors follow guidelines that describe how to use low-level vocal components---volume, pitch, timbre, and tempo---to deliver performances that emphasize important words while maintaining variety, flow and diction. Yet, these techniques are not well-known outside the professional voiceover community, especially among hobbyist producers looking to create their own narrations. We present Narration Coach, an interface that assists novice users in recording scripted narrations. As a user records her narration, our system synchronizes the takes to her script, provides text feedback about how well she is meeting the expert voiceover guidelines, and resynthesizes her recordings to help her hear how she can speak better.

[1]  Joseph P. Olive,et al.  Text-to-speech synthesis , 1995, AT&T Technical Journal.

[2]  John Doherty,et al.  NudgeCam: toward targeted, higher quality media capture , 2010, ACM Multimedia.

[3]  Alan W. Black,et al.  Generating f0 contours for speech synthesis using the tilt intonation theory. , 1997 .

[4]  Hideki Kawahara,et al.  Tandem-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Steve Young,et al.  The HTK book , 1995 .

[6]  Yannis Stylianou,et al.  Voice Transformation: A survey , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Wilmot Li,et al.  UnderScore: musical underlays for audio stories , 2012, UIST '12.

[8]  Andrew Rosenberg,et al.  AutoBI - a tool for automatic toBI annotation , 2010, INTERSPEECH.

[9]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[10]  Wilmot Li,et al.  Content-based tools for editing audio stories , 2013, UIST.

[11]  Mark Liberman,et al.  Speaker identification on the SCOTUS corpus , 2008 .

[12]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[13]  Simon Lucey,et al.  Deformable Model Fitting by Regularized Landmark Mean-Shift , 2010, International Journal of Computer Vision.

[14]  Simon Dixon,et al.  PYIN: A fundamental frequency estimator using probabilistic threshold distributions , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Jeffrey Heer,et al.  Presiding over accidents: system direction of human action , 2004, CHI.

[16]  Mark Dolson,et al.  The Phase Vocoder: A Tutorial , 1986 .

[17]  Eric Moulines,et al.  Voice transformation using PSOLA technique , 1991, Speech Commun..

[18]  Marc Davis,et al.  Designing systems that direct human action , 2005, CHI Extended Abstracts.

[19]  Takeo Igarashi,et al.  Presentation sensei: a presentation training system using speech and image processing , 2007, ICMI '07.

[20]  Maneesh Agrawala,et al.  Generating emotionally relevant musical scores for audio stories , 2014, UIST.

[21]  Marc Davis Active capture: integrating human-computer interaction and computer vision/audition to automate media capture , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[22]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .