Implicit HRTF Modeling Using Temporal Convolutional Networks

Estimation of accurate head-related transfer functions (HRTFs) is crucial to achieve realistic binaural acoustic experiences. HRTFs depend on source/listener locations and are therefore expensive and cumbersome to measure; traditional approaches require listener-dependent measurements of HRTFs at thousands of distinct spatial directions in an anechoic chamber. In this work, we present a data-driven approach to learn HRTFs implicitly with a neural network that achieves state of the art results compared to traditional approaches but relies on a much simpler data capture that can be performed in arbitrary, non-anechoic rooms. Despite that simpler and less acoustically ideal data capture, our deep learning based approach learns HRTF of high quality. We show in a perceptual study that the produced binaural audio is ranked on par with traditional DSP approaches by humans and illustrate that interaural time differences (ITDs), interaural level differences (ILDs) and spectral clues are accurately estimated.

[1]  Quoc V. Le,et al.  HyperNetworks , 2016, ICLR.

[2]  Kristen Grauman,et al.  2.5D Visual Sound , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Pierrick Lotton,et al.  Nonlinear System Identification Using Exponential Swept-Sine Signal , 2010, IEEE Transactions on Instrumentation and Measurement.

[4]  Gregory H. Wakefield,et al.  Introduction to Head-Related Transfer Functions (HRTFs): Representations of HRTFs in Time, Frequency, and Space , 2001 .

[5]  Woodrow Barfield,et al.  The Sense of Presence within Auditory Virtual Environments , 1996, Presence: Teleoperators & Virtual Environments.

[6]  J. Hebrank,et al.  Pinna reflections as cues for localization. , 1974, The Journal of the Acoustical Society of America.

[7]  Nuno Vasconcelos,et al.  Self-Supervised Generation of Spatial Audio for 360 Video , 2018, NIPS 2018.

[8]  Jürgen Peissig,et al.  Measurement of Head-Related Transfer Functions: A Review , 2020, Applied Sciences.

[9]  Ramani Duraiswami,et al.  INTERPOLATION AND RANGE EXTRAPOLATION OF HRTFS , 2004 .

[10]  C. Avendano,et al.  The CIPIC HRTF database , 2001, Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575).

[11]  Antoine Deleforge,et al.  Filterbank Design for End-to-end Speech Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Zamir Ben-Hur,et al.  Localization of Virtual Sounds in Dynamic Listening Using Sparse HRTFs , 2020 .

[13]  Anthony I. Tew,et al.  Analyzing head-related transfer function measurements using surface spherical harmonics , 1998 .

[14]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[15]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Ryuichi Yamamoto,et al.  Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  F. Asano,et al.  Role of spectral cues in median plane localization. , 1990, The Journal of the Acoustical Society of America.

[19]  Hung-Yu Tseng,et al.  Self-Supervised Audio Spatialization with Correspondence Classifier , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[20]  Q. Liu,et al.  Finite difference computation of head-related transfer function for human hearing. , 2003, The Journal of the Acoustical Society of America.

[21]  Boaz Rafaely,et al.  Efficient Representation and Sparse Sampling of Head-Related Transfer Functions Using Phase-Correction Based on Ear Alignment , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Ramani Duraiswami,et al.  Interpolation and range extrapolation of HRTFs [head related transfer functions] , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Mikio Toyama,et al.  Effect of Arrival Time Correction on the Accuracy of Binaural Impulse Response Interpolation--Interpolation Methods of Binaural Response , 2004 .

[24]  Xiaogang Wang,et al.  Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation , 2020, ECCV.

[25]  Gregory D. Hager,et al.  Temporal Convolutional Networks for Action Segmentation and Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Janina Fels,et al.  Perceptually Robust Headphone Equalization for Binaural Reproduction , 2011 .

[27]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[28]  Ramani Duraiswami,et al.  Regularized HRTF fitting using spherical harmonics , 2009, 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[29]  Justin Salamon,et al.  Telling Left From Right: Learning Spatial Correspondence of Sight and Sound , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  F. Itakura,et al.  Interpolating head related transfer functions in the median plane , 1999, Proceedings of the 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. WASPAA'99 (Cat. No.99TH8452).