Acoustic Room Modelling Using 360 Stereo Cameras

In this paper we propose a pipeline for estimating acoustic 3D room structure with geometry and attribute prediction using spherical 360$^{\circ }$ cameras. Instead of setting microphone arrays with loudspeakers to measure acoustic parameters for specific rooms, a simple and practical single-shot capture of the scene using a stereo pair of 360 cameras can be used to simulate those acoustic parameters. We assume that the room and objects can be represented as cuboids aligned to the main axes of the room coordinate (Manhattan world). The scene is captured as a stereo pair using off-the-shelf consumer spherical 360 cameras. A cuboid-based 3D room geometry model is estimated by correspondence matching between captured images and semantic labelling using a convolutional neural network (SegNet). The estimated geometry is used to produce frequency-dependent acoustic predictions of the scene. This is, to our knowledge, the first attempt in the literature to use visual geometry estimation and object classification algorithms to predict acoustic properties. Results are compared to measurements through calculated reverberant spatial audio object parameters used for reverberation reproduction customized to the given loudspeaker set up.

[1]  Deane B. Judd,et al.  CHROMATICITY SENSIBILITY TO STIMULUS DIFFERENCES , 1932 .

[2]  A. Krokstad,et al.  Calculating the acoustical room response by the use of a ray tracing technique , 1968 .

[3]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[4]  J. Borish Extension of the image model to arbitrary polyhedra , 1984 .

[5]  B.D. Van Veen,et al.  Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[6]  M. Vorländer Simulation of the transient and steady‐state sound propagation in rooms using a new combined ray‐tracing/image‐source algorithm , 1989 .

[7]  D. M. Green,et al.  Sound localization by human listeners. , 1991, Annual review of psychology.

[8]  Angelo Farina,et al.  Simultaneous Measurement of Impulse Response and Distortion with a Swept-Sine Technique , 2000 .

[9]  Vesa Välimäki,et al.  Interpolated rectangular 3-D digital waveguide mesh algorithms with frequency warping , 2003, IEEE Trans. Speech Audio Process..

[10]  Soon-Wook Kwon,et al.  Fitting range data to primitives for rapid local 3D modeling using sparse range point clouds , 2004 .

[11]  Kwanghoon Sohn,et al.  3D reconstruction from stereo images for interactions between real and virtual objects , 2005, Signal Process. Image Commun..

[12]  Shigang Li Real-Time Spherical Stereo , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[13]  Zihou Meng,et al.  The Just Noticeable Difference of Noise Length and Reverberation Perception , 2006, 2006 International Symposium on Communications and Information Technologies.

[14]  L. Beranek,et al.  Analysis of Sabine and Eyring equations and their application to concert hall audience and chair absorption. , 2006, The Journal of the Acoustical Society of America.

[15]  Mike Brookes,et al.  Estimation of Glottal Closure Instants in Voiced Speech Using the DYPSA Algorithm , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  D. Murphy,et al.  Acoustic Modeling Using the Digital Waveguide Mesh , 2007, IEEE Signal Processing Magazine.

[17]  T. Lokki,et al.  Geometry reduction in room acoustics modeling , 2008 .

[18]  Richard Szeliski,et al.  Interactive 3D architectural modeling from unordered photo collections , 2008, SIGGRAPH Asia '08.

[19]  M. Kuster Reliability of estimating the room volume from a single room impulse response. , 2008, The Journal of the Acoustical Society of America.

[20]  Randy Goebel,et al.  Interactive Multimedia for Adaptive Online Education , 2009, IEEE Multimedia.

[21]  Richard Szeliski,et al.  Reconstructing building interiors from images , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[22]  Alexei A. Efros,et al.  Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics , 2010, ECCV.

[23]  Stefan Weinzierl,et al.  Assessing the plausibility of virtual acoustic environments , 2012 .

[24]  Maarten van Walstijn,et al.  Room Acoustics Simulation Using 3-D Compact Explicit FDTD Schemes , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Nathan Silberman,et al.  Indoor scene segmentation using a structured light sensor , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[26]  Andrew W. Fitzgibbon,et al.  KinectFusion: Real-time dense surface mapping and tracking , 2011, 2011 10th IEEE International Symposium on Mixed and Augmented Reality.

[27]  H. Mayer,et al.  FINDING CUBOID-BASED BUILDING MODELS IN POINT CLOUDS , 2012 .

[28]  Stefan Weinzierl,et al.  Perceptual Evaluation of Model- and Signal-Based Predictors of the Mixing Time in Binaural Room Impulse Responses * , 2012 .

[29]  Vesa Välimäki,et al.  Fifty Years of Artificial Reverberation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[31]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[32]  Feng Wu,et al.  Efficient 2D-to-3D Correspondence Filtering for Scalable 3D Object Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Paul H. J. Kelly,et al.  SLAM++: Simultaneous Localisation and Mapping at the Level of Objects , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Sujit Dey,et al.  Adaptive Mobile Cloud Computing to Enable Rich Mobile Multimedia Applications , 2013, IEEE Transactions on Multimedia.

[35]  Andrew Owens,et al.  SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels , 2013, 2013 IEEE International Conference on Computer Vision.

[36]  Stefan Bilbao,et al.  Modeling of Complex Geometries and Boundary Conditions in Finite Difference/Finite Volume Time Domain Room Acoustics Simulation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Adrian Hilton,et al.  3D Scene Reconstruction from Multiple Spherical Stereo Pairs , 2013, International Journal of Computer Vision.

[38]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[39]  Jitendra Malik,et al.  Indoor Scene Understanding with RGB-D Images: Bottom-up Segmentation, Object Detection and Semantic Segmentation , 2015, International Journal of Computer Vision.

[40]  Frank Melchior,et al.  Assessing the Plausibility of Non-Individualised Dynamic Binaural Synthesis in a Small Room , 2014 .

[41]  Andrew J. Davison,et al.  A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[42]  Andreas Geiger,et al.  Omnidirectional 3D reconstruction in augmented Manhattan worlds , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[43]  Vladlen Koltun,et al.  Robust reconstruction of indoor scenes , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Thushara D. Abhayapala,et al.  Higher-Order Loudspeakers and Active Compensation for Improved 2D Sound Field Reproduction in Rooms , 2015 .

[45]  Seunghoon Hong,et al.  Decoupled Deep Neural Network for Semi-supervised Semantic Segmentation , 2015, NIPS.

[46]  Philip J. B. Jackson,et al.  Estimation of Room Reflection Parameters for a Reverberant Spatial Audio Object , 2015 .

[47]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[48]  Lauri Savioja,et al.  Overview of geometrical room acoustic modeling techniques. , 2015, The Journal of the Acoustical Society of America.

[49]  Jian Zhang,et al.  Graph Cuts Stereo Matching Based on Patch-Match and Ground Control Points Constraint , 2015, PCM.

[50]  Jianxiong Xiao,et al.  SUN RGB-D: A RGB-D scene understanding benchmark suite , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Julius O. Smith,et al.  Efficient Synthesis of Room Acoustics via Scattering Delay Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[52]  Gang Wang,et al.  Unsupervised Joint Feature Learning and Encoding for RGB-D Scene Labeling , 2015, IEEE Transactions on Image Processing.

[53]  Vibhav Vineet,et al.  Conditional Random Fields as Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[54]  Seunghoon Hong,et al.  Learning Deconvolution Network for Semantic Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[55]  Adrian Hilton,et al.  Block world reconstruction from spherical stereo image pairs , 2015, Comput. Vis. Image Underst..

[56]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[57]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[58]  Jitendra Malik,et al.  Region-Based Convolutional Networks for Accurate Object Detection and Segmentation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  Filippo Maria Fazi,et al.  A perceptual approach to object-based room correction , 2016 .

[60]  Adrian Hilton,et al.  Room Layout Estimation with Object and Material Attributes Information Using a Spherical Camera , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[61]  Minglei Li,et al.  Fitting boxes to Manhattan scenes using linear integer programming , 2016, Int. J. Digit. Earth.

[62]  Maja Pantic,et al.  Deep complementary bottleneck features for visual speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[63]  In-So Kweon,et al.  All-Around Depth from Small Motion with a Spherical Panoramic Camera , 2016, ECCV.

[64]  Jianfei Cai,et al.  Beyond pixels: A comprehensive survey from bottom-up to semantic image segmentation and cosegmentation , 2015, J. Vis. Commun. Image Represent..

[65]  Silvio Savarese,et al.  3D Semantic Parsing of Large-Scale Indoor Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Adrian Hilton,et al.  Acoustic Room Modelling using a Spherical Camera for Reverberant Spatial Audio Objects , 2017 .

[67]  Philip J. B. Jackson,et al.  Modeling horizontal audio-visual coherence with the psychometric function , 2017 .

[68]  Philip J. B. Jackson,et al.  Acoustic Reflector Localization: Novel Image Source Reversion and Direct Localization Methods , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[69]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[70]  Philip J. B. Jackson,et al.  Object-based reverberation encoding from first-order Ambisonic RIRs , 2017 .

[71]  Chris Pike,et al.  An Impulse Response Dataset for Dynamic Data-Based Auralization of Advanced Sound Systems , 2017 .

[72]  Richard Szeliski,et al.  Low-cost 360 stereo photography and video capture , 2017, ACM Trans. Graph..

[73]  Frank Melchior,et al.  Object-Based Reverberation for Spatial Audio , 2017 .

[74]  Matthias Nießner,et al.  Matterport3D: Learning from RGB-D Data in Indoor Environments , 2017, 2017 International Conference on 3D Vision (3DV).

[75]  Thomas A. Funkhouser,et al.  Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[77]  Christoph Pörschmann,et al.  BINAURALIZATION OF OMNIDIRECTIONAL ROOM IMPULSE RESPONSES-ALGORITHM AND TECHNICAL EVALUATION , 2017 .

[78]  Luc Van Gool,et al.  AENet: Learning Deep Audio Features for Video Analysis , 2017, IEEE Transactions on Multimedia.

[79]  Adrian Hilton,et al.  Multiple Speaker Tracking in Spatial Audio via PHD Filtering and Depth-Audio Fusion , 2018, IEEE Transactions on Multimedia.

[80]  Bruno Fazenda,et al.  The Effect of Visual Cues and Binaural Rendering Method on Plausibility in Virtual Environments , 2018 .

[81]  Silvio Savarese,et al.  Im2Pano3D: Extrapolating 360° Structure and Semantics Beyond the Field of View , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[82]  Frank Melchior,et al.  An Audio-Visual System for Object-Based Audio: From Recording to Listening , 2018, IEEE Transactions on Multimedia.

[83]  Annika Neidhardt,et al.  Plausibility of an Interactive Approaching Motion towards a Virtual Sound Source Based on Simplified BRIR Sets , 2018 .

[84]  T. Lokki,et al.  Influence of Sound-Absorbing Material Placement on Room Acoustical Parameters , 2019, Acoustics.

[85]  Jongbin Ryu,et al.  OmniMVS: End-to-End Learning for Omnidirectional Stereo Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[86]  M. Nießner,et al.  SG-NN: Sparse Generative Neural Networks for Self-Supervised Scene Completion of RGB-D Scans , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[87]  Xu Zhao,et al.  EdgeStereo: An Effective Multi-task Learning Network for Stereo Matching and Edge Detection , 2019, International Journal of Computer Vision.

[88]  Long Chen,et al.  A survey on deep learning methods for scene flow estimation , 2020, Pattern Recognit..