Audible Panorama: Automatic Spatial Audio Generation for Panorama Imagery

As 360 deg cameras and virtual reality headsets become more popular, panorama images have become increasingly ubiquitous. While sounds are essential in delivering immersive and interactive user experiences, most panorama images, however, do not come with native audio. In this paper, we propose an automatic algorithm to augment static panorama images through realistic audio assignment. We accomplish this goal through object detection, scene classification, object depth estimation, and audio source placement. We built an audio file database composed of over $500$ audio files to facilitate this process. We designed and conducted a user study to verify the efficacy of various components in our pipeline. We run our method on a large variety of panorama images of indoor and outdoor scenes. By analyzing the statistics, we learned the relative importance of these components, which can be used in prioritizing for power-sensitive time-critical tasks like mobile augmented reality (AR) applications.

[1]  Woodrow Barfield,et al.  The Sense of Presence within Auditory Virtual Environments , 1996, Presence: Teleoperators & Virtual Environments.

[2]  Durand R. Begault,et al.  3-D Sound for Virtual Reality and Multimedia Cambridge , 1994 .

[3]  Xavier Serra,et al.  Freesound technical demo , 2013, ACM Multimedia.

[4]  Min Sun,et al.  Tell Me Where to Look: Investigating Ways for Assisting Focus in 360° Video , 2017, CHI.

[5]  J. Dodiya,et al.  Perspectives on Potential of Sound in Virtual Environments , 2007, 2007 IEEE International Workshop on Haptic, Audio and Visual Environments and Games.

[6]  Michael Weber,et al.  Vanishing Importance: Studying Immersive Effects of Game Audio Perception on Player Experiences in Virtual Reality , 2018, CHI.

[7]  Jonathan Steuer,et al.  Defining virtual reality: dimensions determining telepresence , 1992 .

[8]  Gautham J. Mysore,et al.  Equalization matching of speech recordings in real-world environments , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Takao Onoye,et al.  A Ray Tracing Simulation of Sound Diffraction Based on the Analytic Secondary Source Model , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Limin Wang,et al.  Temporal Action Detection with Structured Segment Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Suchendra M. Bhandarkar,et al.  Action Recognition in Still Images Using Word Embeddings from Natural Language Descriptions , 2017, 2017 IEEE Winter Applications of Computer Vision Workshops (WACVW).

[13]  Doug A. Bowman,et al.  Virtual Reality: How Much Immersion Is Enough? , 2007, Computer.

[14]  Florian Schweiger,et al.  The Geometry of Storytelling: Theatrical Use of Space for 360-degree Videos and Virtual Reality , 2017, CHI.

[15]  Joseph J. LaViola,et al.  Dynamic Stereoscopic 3D Parameter Adjustment for Enhanced Depth Discrimination , 2016, CHI.

[16]  Nuno Vasconcelos,et al.  Self-Supervised Generation of Spatial Audio for 360 Video , 2018, NIPS 2018.

[17]  Qiuqi Ruan,et al.  Hierarchical and Spatio-Temporal Sparse Representation for Human Action Recognition , 2018, IEEE Transactions on Image Processing.

[18]  Ming C. Lin,et al.  Precomputed wave simulation for real-time sound propagation of dynamic sources in complex scenes , 2010, ACM Trans. Graph..

[19]  Sergio Guadarrama,et al.  Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Areti Damala,et al.  Experimenting with Sound Immersion in an Arts and Crafts Museum , 2009, ICEC.

[21]  Eero P. Simoncelli,et al.  Summary statistics in auditory perception , 2013, Nature Neuroscience.

[22]  Dingzeyu Li,et al.  Scene-aware audio for 360° videos , 2018, ACM Trans. Graph..

[23]  Peter M Visscher,et al.  Sizing up human height variation , 2008, Nature Genetics.

[24]  Anthony Tang,et al.  Watching 360° Videos Together , 2017, CHI.

[25]  Stephen DiVerdi,et al.  CollaVR: Collaborative In-Headset Review for VR Video , 2017, UIST.

[26]  Kai Kunze,et al.  IN360: A 360-Degree-Video Platform to Change Students Preconceived Notions on Their Career , 2017, CHI Extended Abstracts.

[27]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[28]  Jiulin Zhang,et al.  The Influence of Background Music of Video Games on Immersion , 2015 .

[29]  Pat Hanrahan,et al.  On being the right scale: sizing large collections of 3D models , 2014, SIGGRAPH ASIA Indoor Scene Understanding Where Graphics Meets Vision.

[30]  Björn Hartmann,et al.  Shot Orientation Controls for Interactive Cinematography with 360 Video , 2017, UIST.

[31]  Björn Hartmann,et al.  HindSight: Enhancing Spatial Awareness by Sonifying Detected Objects in Real-Time 360-Degree Video , 2018, CHI.

[32]  Chen Shen,et al.  Synthesizing sounds from rigid-body simulations , 2002, SCA '02.

[33]  Eamonn O'Neill,et al.  Compensating for Distance Compression in Audiovisual Virtual Environments Using Incongruence , 2016, CHI.

[34]  Josh H McDermott,et al.  Statistics of natural reverberation enable perceptual separation of sound and space , 2016, Proceedings of the National Academy of Sciences.