Immersive Spatial Audio Reproduction for VR/AR Using Room Acoustic Modelling from 360° Images

Recent progresses in Virtual Reality (VR) and Augmented Reality (AR) allow us to experience various VR/AR applications in our daily life. In order to maximise the immersiveness of user in VR/AR environments, a plausible spatial audio reproduction synchronised with visual information is essential. In this paper, we propose a simple and efficient system to estimate room acoustic for plausible reproducton of spatial audio using 360° cameras for VR/AR applications. A pair of 360° images is used for room geometry and acoustic property estimation. A simplified 3D geometric model of the scene is estimated by depth estimation from captured images and semantic labelling using a convolutional neural network (CNN). The real environment acoustics are characterised by frequency-dependent acoustic predictions of the scene. Spatially synchronised audio is reproduced based on the estimated geometric and acoustic properties in the scene. The reconstructed scenes are rendered with synthesised spatial audio as VR/AR content. The results of estimated room geometry and simulated spatial audio are evaluated against the actual measurements and audio calculated from ground-truth Room Impulse Responses (RIRs) recorded in the rooms.

[1]  George Drettakis,et al.  Bimodal perception of audio-visual material properties for virtual environments , 2010, TAP.

[2]  Adrian Hilton,et al.  Room Layout Estimation with Object and Material Attributes Information Using a Spherical Camera , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[3]  Adrian Hilton,et al.  Acoustic Room Modelling using a Spherical Camera for Reverberant Spatial Audio Objects , 2017 .

[4]  Hideo Saito,et al.  A survey of diminished reality: Techniques for visually concealing, eliminating, and seeing through real objects , 2017, IPSJ Transactions on Computer Vision and Applications.

[5]  J. S. Bradley,et al.  Review of objective room acoustics measures and future needs , 2011 .

[6]  Mendel Kleiner,et al.  Auditory-Induced Presence in Mixed Reality Environments and Related Technology , 2010, The Engineering of Mixed Reality Systems.

[7]  M. Arditti Unity , 1957, Journal of the Irish Medical Association.

[8]  Bruno Fazenda,et al.  The Effect of Visual Cues and Binaural Rendering Method on Plausibility in Virtual Environments , 2018 .

[9]  Matthew Turk,et al.  Multimodal interaction: A review , 2014, Pattern Recognit. Lett..

[10]  Philip J. B. Jackson,et al.  Acoustic Reflector Localization: Novel Image Source Reversion and Direct Localization Methods , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Nuno Vasconcelos,et al.  Self-Supervised Generation of Spatial Audio for 360 Video , 2018, NIPS 2018.

[12]  Jan-Michael Frahm,et al.  Exploring High-Level Plane Primitives for Indoor 3D Reconstruction with a Hand-held RGB-D Camera , 2012, ACCV Workshops.

[13]  Lauri Savioja,et al.  Overview of geometrical room acoustic modeling techniques. , 2015, The Journal of the Acoustical Society of America.

[14]  Rafael C. González,et al.  Local Determination of a Moving Contrast Edge , 1985, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Zihou Meng,et al.  The Just Noticeable Difference of Noise Length and Reverberation Perception , 2006, 2006 International Symposium on Communications and Information Technologies.

[16]  Jianfei Cai,et al.  Beyond pixels: A comprehensive survey from bottom-up to semantic image segmentation and cosegmentation , 2015, J. Vis. Commun. Image Represent..

[17]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[18]  Dinesh Manocha,et al.  Acoustic Classification and Optimization for Multi-Modal Rendering of Real-World Scenes , 2018, IEEE Transactions on Visualization and Computer Graphics.

[19]  Carsten Rother,et al.  Dense Semantic Image Segmentation with Objects and Attributes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Nicolas Tsingos,et al.  Acoustic Rendering and Auditory–Visual Cross‐Modal Perception and Interaction , 2012, Comput. Graph. Forum.

[21]  Ravish Mehra,et al.  Efficient construction of the spatial room impulse response , 2017 .

[22]  Dingzeyu Li,et al.  Scene-aware audio for 360° videos , 2018, ACM Trans. Graph..

[23]  Judy M. Vance,et al.  Industry use of virtual reality in product design and manufacturing: a survey , 2017, Virtual Reality.

[24]  Kwanghoon Sohn,et al.  3D reconstruction from stereo images for interactions between real and virtual objects , 2005, Signal Process. Image Commun..

[25]  Alexei A. Efros,et al.  Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics , 2010, ECCV.

[26]  Stefan Weinzierl,et al.  Assessing the plausibility of virtual acoustic environments , 2012 .

[27]  Jianxiong Xiao,et al.  SUN RGB-D: A RGB-D scene understanding benchmark suite , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[29]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Deane B. Judd,et al.  CHROMATICITY SENSIBILITY TO STIMULUS DIFFERENCES , 1932 .

[31]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[32]  Thomas A. Funkhouser,et al.  Modeling acoustics in virtual environments using the uniform theory of diffraction , 2001, SIGGRAPH.

[33]  Francis Rumsey,et al.  Spatial quality evaluation for reproduced sound: terminology, meaning and a scene-based paradigm , 2002 .