VoiceAssist: Guiding Users to High-Quality Voice Recordings

Voice recording is a challenging task with many pitfalls due to sub-par recording environments, mistakes in recording setup, microphone quality, etc. Newcomers to voice recording often have difficulty recording their voice, leading to recordings with low sound quality. Many amateur recordings of poor quality have two key problems: too much reverberation (echo), and too much background noise (e.g. fans, electronics, street noise). We present VoiceAssist, a system that helps inexperienced users produce high quality recordings by providing real-time visual feedback on audio quality. We integrate modern audio quality measures into an interactive human-machine feedback loop, so that the audio quality can be maximized at capture-time. We demonstrate the utility of this feedback for improving the recording quality with a user study. When presented with visual feedback about recording quality, users produced recordings that were strongly preferred by third-party listeners, when compared to recordings made without this feedback.

[1]  Pascal Scalart,et al.  Improved Signal-to-Noise Ratio Estimation for Speech Enhancement , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Marc Davis Active capture: integrating human-computer interaction and computer vision/audition to automate media capture , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[3]  Gautham J. Mysore,et al.  Capture-Time Feedback for Recording Scripted Narration , 2015, UIST.

[4]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[5]  Richard M. Schwartz,et al.  Enhancement of speech corrupted by acoustic noise , 1979, ICASSP.

[6]  Jeffrey Heer,et al.  Presiding over accidents: system direction of human action , 2004, CHI.

[7]  T. Houtgast,et al.  The Modulation Transfer Function in Room Acoustics as a Predictor of Speech Intelligibility , 1973 .

[8]  John Doherty,et al.  NudgeCam: toward targeted, higher quality media capture , 2010, ACM Multimedia.

[9]  Patrick A. Naylor,et al.  Speech Dereverberation , 2010 .

[10]  John S. Bradley,et al.  A just noticeable difference in C50 for speech , 1999 .

[11]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[12]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[13]  Masashi Unoki,et al.  Blind method of estimating speech transmission index from reverberant speech signals , 2013, 21st European Signal Processing Conference (EUSIPCO 2013).

[14]  Takeo Igarashi,et al.  Presentation sensei: a presentation training system using speech and image processing , 2007, ICMI '07.

[15]  Mohan S. Kankanhalli,et al.  ClickSmart: A Context-Aware Viewpoint Recommendation System for Mobile Photography , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[16]  Daniel C. Burnett,et al.  WebRTC: APIs and RTCWEB Protocols of the HTML5 Real-Time Web , 2012 .

[17]  Jesper Jensen,et al.  A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Haizhou Li,et al.  Learning to estimate reverberation time in noisy and reverberant rooms , 2015, INTERSPEECH.

[19]  Paris Smaragdis,et al.  Blind Estimation of the Speech Transmission Index for Speech Quality Prediction , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Marc Davis,et al.  Designing systems that direct human action , 2005, CHI Extended Abstracts.

[21]  Thomas Sporer,et al.  PEAQ - The ITU Standard for Objective Measurement of Perceived Audio Quality , 2000 .

[22]  Gautham J. Mysore,et al.  Fast and easy crowdsourced perceptual audio evaluation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).