Source Separation for Enabling Dialogue Enhancement in Object-based Broadcast with MPEG-H

Dialogue Enhancement (DE) is one of the most promising applications of user interactivity enabled by object-based audio broadcasting. DE allows personalization of the relative level of dialogue for intelligibility or aesthetic reasons. This paper discusses the implementation of DE in object-based audio transport with MPEG-H, with a special focus on source separation methods enabling DE also for legacy content without original objects available. The userbenefit of DE is assessed using the Adjustment/Satisfaction Test methodology. The test results demonstrate the need for an individually adjustable dialogue level because of highly-varying personal preferences. The test also investigates the subjective quality penalty from using source separation for obtaining the objects. The results show that even an imperfect separation result can successfully enable DE leading to increased end-user satisfaction.

[1]  S. Scott,et al.  Comprehension of familiar and unfamiliar native accents under adverse listening conditions. , 2009, Journal of experimental psychology. Human perception and performance.

[2]  Mark D. Plumbley,et al.  Combining Mask Estimates for Single Channel Audio Source Separation Using Deep Neural Networks , 2016, INTERSPEECH.

[3]  Bryan Pardo,et al.  Predicting algorithm efficacy for adaptive multi-cue source separation , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[4]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  DeLiang Wang,et al.  Deep Learning Based Binaural Speech Separation in Reverberant Environments , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Harald Fuchs,et al.  Advanced Clean Audio Solution: Dialogue Enhancement , 2014 .

[7]  Andrew Hines,et al.  Exploring a Perceptually-Weighted DNN-based Fusion Model for Speech Separation , 2018, AICS.

[8]  Peter Mapp Intelligibility of Cinema & TV Sound Dialogue , 2016 .

[9]  Algorithms to measure audio programme loudness and true-peak audio level , 2011 .

[10]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Karim M. Ibrahim,et al.  Primary-Ambient Source Separation for Upmixing to Surround Sound Systems , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Ben Shirley,et al.  Clean Audio for TV broadcast: an object-based approach for hearing impaired viewers , 2015 .

[13]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[14]  Yesenia Lacouture-Parodi,et al.  Dialogue enhancement of stereo sound , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[15]  Cynthia G. Clopper,et al.  Perception of Dialect Variation in Noise: Intelligibility and Classification , 2008, Language and speech.

[16]  James Woodcock,et al.  Personalized Object-Based Audio for Hearing Impaired TV Viewers , 2017 .

[17]  Eliathamby Ambikairajah,et al.  Adaptive noise estimation algorithm for speech enhancement , 2003 .

[18]  Dan Barry,et al.  Sound Source Separation: Azimuth Discrimination and Resynthesis , 2004 .

[19]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[20]  Gerhard Doblinger,et al.  Computationally efficient speech enhancement by spectral minima tracking in subbands , 1995, EUROSPEECH.

[21]  Oliver Hellmuth,et al.  Speech Enhancement of Movie Sound , 2008 .

[22]  Özgür Yilmaz,et al.  Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[23]  Tomas Bäckström,et al.  An evaluation of stereo speech enhancement methods for different audio-visual scenarios , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[24]  Jun Du,et al.  A regression approach to binaural speech segregation via deep neural network , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[25]  Yi Jiang,et al.  Binaural Classification for Reverberant Speech Segregation Using Deep Neural Networks , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Jean-Marc Jot,et al.  A Frequency-Domain Approach to Multichannel Upmix , 2004 .

[27]  Hans-Günter Hirsch,et al.  Noise estimation techniques for robust speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[28]  Rainer Martin,et al.  Spectral Subtraction Based on Minimum Statistics , 2001 .

[29]  Gautham J. Mysore,et al.  Universal speech models for speaker independent single channel source separation , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Frank Melchior,et al.  Does Environmental Noise Influence Preference of Background-Foreground Audio Balance? , 2016 .

[31]  Emmanuel Vincent,et al.  Fusion Methods for Speech Enhancement and Audio Source Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32]  Masakiyo Fujimoto,et al.  Exploring multi-channel features for denoising-autoencoder-based speech enhancement , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Jan Plogsties,et al.  MPEG-H Audio—The New Standard for Universal Spatial / 3D Audio Coding , 2014 .

[34]  Thomas Esch,et al.  Efficient musical noise suppression for speech enhancement system , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35]  Adrian Murtaza,et al.  MPEG-D Spatial Audio Object Coding for Dialogue Enhancement (SAOC-DE) , 2015 .

[36]  Ángel M. Gómez,et al.  Dual-channel DNN-based speech enhancement for smartphones , 2017, 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP).

[37]  Earl Vickers Frequency-Domain Two- To Three-Channel Upmix for Center Channel Derivation and Speech Enhancement , 2009 .

[38]  Jürgen Herre,et al.  The Adjustment/Satisfaction Test (A/ST) for the Evaluation of Personalization in Broadcast Services and Its Application to Dialogue Enhancement , 2018, IEEE Transactions on Broadcasting.

[39]  Jonathan Le Roux,et al.  Discriminative NMF and its application to single-channel source separation , 2014, INTERSPEECH.

[40]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[41]  Juha Merimaa,et al.  Correlation-Based Ambience Extraction from Stereo Recordings , 2007 .

[42]  Birger Kollmeier,et al.  SNR estimation based on amplitude modulation analysis with applications to noise suppression , 2003, IEEE Trans. Speech Audio Process..

[43]  Paris Smaragdis,et al.  Mixtures of Local Dictionaries for Unsupervised Speech Enhancement , 2015, IEEE Signal Processing Letters.

[44]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[45]  Volker Hohmann,et al.  Sub-band SNR estimation using auditory feature processing , 2003, Speech Commun..

[46]  B. Kollmeier,et al.  Influence of the linguistic complexity in relation to speech material on non-native speech perception in noise , 2010 .

[47]  Mary Florentine,et al.  Speech perception in noise by fluent, non‐native listeners , 1985 .

[48]  Sascha Disch,et al.  Extending Harmonic-Percussive Separation of Audio Signals , 2014, ISMIR.

[49]  Paul Kendrick,et al.  ITC Clean Audio Project , 2004 .

[50]  Andreas Niedermeier,et al.  Development of the MPEG-H TV Audio System for ATSC 3.0 , 2017, IEEE Transactions on Broadcasting.