Can we Automatically Transform Speech Recorded on Common Consumer Devices in Real-World Environments into Professional Production Quality Speech?—A Dataset, Insights, and Challenges

The goal of speech enhancement is typically to recover clean speech from noisy, reverberant, and often bandlimited speech in order to yield improved intelligibility, clarity, or automatic speech recognition performance. However, the acoustic goal for a great deal of speech content such as voice overs, podcasts, demo videos, lecture videos, and audio stories is often not merely clean speech, but speech that is aesthetically pleasing. This is achieved in professional recording studios by having a skilled sound engineer record clean speech in an acoustically treated room and then edit and process it with audio effects (which we refer to as production). A growing amount of speech content is being recorded on common consumer devices such as tablets, smartphones, and laptops. Moreover, it is typically recorded in common but non-acoustically treated environments such as homes and offices. We argue that the goal of enhancing such recordings should not only be to make it sound cleaner as would be done using traditional speech enhancement techniques, but to make it sound like it was recorded and produced in a professional recording studio. In this paper, we show why this can be beneficial, describe a new data set (a great deal of which was recorded in a professional recording studio) that we prepared to help in developing algorithms for this purpose, and discuss some insights and challenges associated with this problem.

[1]  Pascal Scalart,et al.  Speech enhancement based on a priori signal to noise estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[2]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[3]  Jon Barker,et al.  The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[5]  Emmanuel Vincent,et al.  Subjective and Objective Quality Assessment of Audio Source Separation , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Alexander U. Case Sound FX: Unlocking the Creative Potential of Recording Studio Effects , 2007 .

[7]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Willem Bastiaan Kleijn,et al.  Bandwidth expansion of speech based on vector quantization of the mel frequency cepstral coefficients , 1999, 1999 IEEE Workshop on Speech Coding Proceedings. Model, Coders, and Error Criteria (Cat. No.99EX351).

[9]  Bob Katz,et al.  Mastering Audio: The Art and the Science , 2002 .

[10]  Gautham J. Mysore,et al.  Language informed bandwidth expansion , 2012, 2012 IEEE International Workshop on Machine Learning for Signal Processing.

[11]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[12]  Bobby Owsinski The Recording Engineer's Handbook , 2004 .

[13]  Gautham J. Mysore,et al.  Speaker and noise independent voice activity detection , 2013, INTERSPEECH.

[14]  Bobby Owsinski The Mixing Engineer's Handbook , 1999 .

[15]  Patrick A. Naylor,et al.  Speech Dereverberation , 2010 .

[16]  Paris Smaragdis,et al.  Speech Enhancement by Online Non-negative Spectrogram Decomposition in Non-stationary Noise Environments , 2012, INTERSPEECH.

[17]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[18]  Naveen Parihar,et al.  Performance analysis of the Aurora large vocabulary baseline system , 2004, 2004 12th European Signal Processing Conference.

[19]  Udo Zölzer,et al.  Adaptive digital audio effects (a-DAFx): a new class of sound transformations , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  J. Scolapio,et al.  The art and science. , 2003, Nutrition in clinical practice : official publication of the American Society for Parenteral and Enteral Nutrition.

[21]  Daniel P. W. Ellis,et al.  Speech decoloration based on the product-of-filters model , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[23]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[25]  Joshua D. Reiss,et al.  Digital Dynamic Range Compressor Design—A Tutorial and Analysis , 2012 .

[26]  Tomohiro Nakatani,et al.  The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[27]  Joshua D. Reiss,et al.  Parameter Automation in a Dynamic Range Compressor , 2013 .

[28]  Tamar Frankel [The theory and the practice...]. , 2001, Tijdschrift voor diergeneeskunde.