DPLM: A Deep Perceptual Spatial-Audio Localization Metric

Subjective evaluations are critical for assessing the perceptual realism of sounds in audio-synthesis driven technologies like augmented and virtual reality. However, they are challenging to set up, fatiguing for users, and expensive. In this work, we tackle the problem of capturing the perceptual characteristics of localizing sounds. Specifically, we propose a framework for building a general-purpose quality metric to assess spatial localization differences between two binaural recordings. We model localization similarity by utilizing activation-level distances from deep networks trained for direction of arrival (DOA) estimation. Our proposed metric (DPLM) outperforms baseline metrics on correlation with subjective ratings on a diverse set of datasets, even without the benefit of any human-labeled training data.

[1]  J. Beerends,et al.  Perceptual Objective Listening Quality Assessment ( POLQA ) , The Third Generation ITU-T Standard for End-to-End Speech Quality Measurement Part II – Perceptual Model , 2013 .

[2]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Sascha Spors,et al.  A Free Database of Head Related Impulse Response Measurements in the Horizontal Plane with Multiple Distances , 2011 .

[4]  Ravish Mehra,et al.  The Effect of Generic Headphone Compensation on Binaural Renderings , 2019 .

[5]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Andrew Hines,et al.  AMBIQUAL: Towards a Quality Metric for Headphone Rendered Compressed Ambisonic Spatial Audio , 2020 .

[7]  Adam Finkelstein,et al.  CDPAM: Contrastive Learning for Perceptual Audio Similarity , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Nicholas J. Bryan,et al.  A Differentiable Perceptual Audio Metric Learned from Just Noticeable Differences , 2020, Interspeech.

[9]  Christoph Pörschmann,et al.  Perceptual Evaluation of Mitigation Approaches of Impairments due to Spatial Undersampling in Binaural Rendering of Spherical Microphone Array Data , 2020, Journal of the Audio Engineering Society.

[10]  Sebastian Schneider,et al.  Standardization of PEAQ-MC: Extension of ITU-R BS.1387-1 to Multichannel Audio , 2010 .

[11]  Archontis Politis,et al.  A Dataset of Reverberant Spatial Sound Scenes with Moving Sources for Sound Event Localization and Detection , 2020, DCASE.

[12]  Boaz Rafaely,et al.  Binaural Reproduction Based on Bilateral Ambisonics and Ear-Aligned HRTFs , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Dinesh Manocha,et al.  Regression and Classification for Direction-of-Arrival Estimation with Convolutional Recurrent Neural Networks , 2019, INTERSPEECH.

[15]  Sascha Spors,et al.  Database of Single-Channel and Binaural Room Impulse Responses of a 64-Channel Loudspeaker Array , 2015 .

[16]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[17]  Jürgen Herre,et al.  Objective Assessment of Spatial Audio Quality Using Directional Loudness Maps , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Keong-Mo Sung,et al.  Perceptual Objective Quality Evaluation Method for High Quality Multichannel Audio Codecs , 2013 .

[19]  Johannes Gehrke,et al.  The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Speech Quality and Testing Framework , 2020, ArXiv.

[20]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Jan Skoglund,et al.  Perceptual Evaluation of Bitrate Compressed Ambisonic Scenes in Loudspeaker Based Reproduction , 2019 .

[22]  Chris Pike,et al.  MEASUREMENT AND ANALYSIS OF A SPATIALLY SAMPLED BINAURAL ROOM IMPULSE RESPONSE DATASET , 2014 .

[23]  Hyunkook Lee,et al.  360° Binaural Room Impulse Response (BRIR) Database for 6DOF Spatial Perception Research , 2019 .

[24]  Vladlen Koltun,et al.  Speech Denoising with Deep Feature Losses , 2018, INTERSPEECH.

[25]  Francis Rumsey,et al.  Spatial Audio Quality Perception (Part 1): Impact of Commonly Encountered Processes , 2015 .

[26]  Anurag Kumar,et al.  SAGRNN: Self-Attentive Gated RNN For Binaural Speaker Separation With Interaural Cue Preservation , 2021, IEEE Signal Processing Letters.

[27]  Gaetan Lorho,et al.  A Binaural Auditory Model for the Evaluation of Reproduced Stereophonic Sound , 2008 .

[28]  Yaser Sheikh,et al.  Neural Synthesis of Binaural Speech From Mono Audio , 2021, ICLR.

[29]  Matthieu Herrb,et al.  Binaural room impulse responses of an apartment-like environment , 2016 .

[30]  Peter Vary,et al.  A binaural room impulse response database for the evaluation of dereverberation algorithms , 2009, 2009 16th International Conference on Digital Signal Processing.

[31]  Peter Vary,et al.  An extension of the PEAQ measure by a binaural hearing model , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]  Stephan D. Ewert,et al.  Assessment and Prediction of Binaural Aspects of Audio Quality , 2017 .

[33]  S. Weinzierl,et al.  BRAS - Benchmark for Room Acoustical Simulation , 2020 .