Predicting Preferred Dialogue-to-Background Loudness Difference in Dialogue-Separated Audio

Dialogue Enhancement (DE) enables the rebalancing of dialogue and background sounds to fit personal preferences and needs in the context of broadcast audio. When individual audio stems are unavailable from production, Dialogue Separation (DS) can be applied to the final audio mixture to obtain esti-mates of these stems. This work focuses on Preferred Loudness Differences (PLDs) between dialogue and background sounds. While previous studies determined the PLD through a listening test employing original stems from production, stems estimated by DS are used in the present study. In addition, a larger variety of signal classes is considered. PLDs vary substantially across individuals (average interquartile range: 5.7 LU). Despite this variability, PLDs are found to be highly dependent on the signal type under consideration, and it is shown that median PLDs can be predicted using objective intelligibility metrics. Two existing baseline prediction methods - intended for use with original stems - displayed a Mean Absolute Error (MAE) of 7.5 LU and 5 LU, respectively. A modified baseline (MAE: 3.2 LU) and an alternative approach (MAE: 2.5 LU) are proposed. Results support the viability of processing final broadcast mixtures with DS and offering an alternative remixing that accounts for median PLDs.

[1]  Jonathan Le Roux,et al.  Tackling the Cocktail Fork Problem for Separation and Transcription of Real-World Soundtracks , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Matteo Torcoli,et al.  Dialogue Enhancement and Listening Effort in Broadcast Audio: A Multimodal Evaluation , 2022, 2022 14th International Conference on Quality of Multimedia Experience (QoMEX).

[3]  Jouni Paulus,et al.  Sampling Frequency Independent Dialogue Separation , 2022, 2022 30th European Signal Processing Conference (EUSIPCO).

[4]  Bernd T. Meyer,et al.  Reduction of Subjective Listening Effort for TV Broadcast Signals With Recurrent Neural Networks , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Jouni Paulus,et al.  A Hands-on Comparison of DNNs for Dialog Separation Using Transfer Learning from Music Source Separation , 2021, Interspeech.

[6]  Jouni Paulus,et al.  Loudness Differences for Voice-Over-Voice Audio in TV and Streaming , 2020, Journal of the Audio Engineering Society.

[7]  Jouni Paulus,et al.  Preferred Levels for Background Ducking to Produce Esthetically Pleasing Audio for TV with Clear Speech , 2019, Journal of the Audio Engineering Society.

[8]  Jürgen Herre,et al.  Source Separation for Enabling Dialogue Enhancement in Object-based Broadcast with MPEG-H , 2019, Journal of the Audio Engineering Society.

[9]  Jürgen Herre,et al.  The Adjustment/Satisfaction Test (A/ST) for the Evaluation of Personalization in Broadcast Services and Its Application to Dialogue Enhancement , 2018, IEEE Transactions on Broadcasting.

[10]  Michael Schoeffler,et al.  webMUSHRA — A Comprehensive Framework for Web-based Listening Tests , 2018 .

[11]  Bruno Fazenda,et al.  Automatic speech-to-background ratio selection to maintain speech intelligibility in broadcasts using an objective intelligibility metric , 2018 .

[12]  Bruno M Fazenda,et al.  A metric for predicting binaural speech intelligibility in stationary noise and competing speech maskers. , 2016, The Journal of the Acoustical Society of America.

[13]  Yesenia Lacouture-Parodi,et al.  Dialogue enhancement of stereo sound , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[14]  Bruno Fazenda,et al.  A glimpse-based approach for predicting binaural intelligibility with single and multiple maskers in anechoic conditions , 2015, INTERSPEECH.

[15]  Oliver Hellmuth,et al.  Speech Enhancement of Movie Sound , 2008 .

[16]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  J. S. Long,et al.  Using Heteroscedasticity Consistent Standard Errors in the Linear Regression Model , 2000 .

[18]  Cassia Valentini-Botinhao,et al.  Evaluating the predictions of objective intelligibility metrics for modified and synthetic speech , 2016, Comput. Speech Lang..

[19]  Method for the subjective assessment of intermediate quality level of , 2014 .

[20]  Algorithms to measure audio programme loudness and true-peak audio level , 2011 .