An Audio-Visual System for Object-Based Audio: From Recording to Listening

Object-based audio is an emerging representation for audio content, where content is represented in a reproduction-format-agnostic way and, thus, produced once for consumption on many different kinds of devices. This affords new opportunities for immersive, personalized, and interactive listening experiences. This paper introduces an end-to-end object-based spatial audio pipeline, from sound recording to listening. A high-level system architecture is proposed, which includes novel audio-visual interfaces to support object-based capture and listener-tracked rendering, and incorporates a proposed component for objectification, that is, recording content directly into an object-based form. Text-based and extensible metadata enable communication between the system components. An open architecture for object rendering is also proposed. The system's capabilities are evaluated in two parts. First, listener-tracked reproduction of metadata automatically estimated from two moving talkers is evaluated using an objective binaural localization model. Second, object-based scene capture with audio extracted using blind source separation (to remix between two talkers) and beamforming (to remix a recording of a jazz group) is evaluated with perceptually motivated objective and subjective experiments. These experiments demonstrate that the novel components of the system add capabilities beyond the state of the art. Finally, we discuss challenges and future perspectives for object-based audio workflows.

[1]  Yiteng Huang Immersive audio schemes , 2011, IEEE Signal Processing Magazine.

[2]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[3]  Frank Melchior,et al.  Object-based broadcasting - curation, responsiveness and user experience , 2014 .

[4]  Simon Lucey,et al.  Face alignment through subspace constrained mean-shifts , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[5]  R. Maas,et al.  A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research , 2016, EURASIP Journal on Advances in Signal Processing.

[6]  George T. Heineman,et al.  Component-Based Software Engineering: Putting the Pieces Together , 2001 .

[7]  Filippo Maria Fazi,et al.  Object-Based Audio Reproduction using a Listener-Position Adaptive Stereo System , 2016 .

[8]  Hareo Hamada,et al.  Local sound field reproduction using two closely spaced loudspeakers , 1998 .

[9]  Ben Shirley,et al.  Demo paper: Audio object extraction for live sports broadcast , 2013, 2013 IEEE International Conference on Multimedia and Expo Workshops (ICMEW).

[10]  Frank Melchior,et al.  Object-based audio applied to football broadcasts , 2013, ImmersiveMe '13.

[11]  Josef Kittler,et al.  Robust Multi-Speaker Tracking via Dictionary Learning and Identity Modeling , 2014, IEEE Transactions on Multimedia.

[12]  Frank Melchior,et al.  An Assessment of Virtual Surround Sound Systems for Headphone Listening of 5.1 Multichannel Audio , 2013 .

[13]  Methods for the subjective assessment of small impairments in audio systems , 2015 .

[14]  Luc Van Gool,et al.  Random Forests for Real Time 3D Face Analysis , 2012, International Journal of Computer Vision.

[15]  Frank Melchior,et al.  Object-Based 3D Audio Production for Virtual Reality Using the Audio Definition Model , 2016 .

[16]  Aggelos K. Katsaggelos,et al.  Audiovisual Fusion: Challenges and New Approaches , 2015, Proceedings of the IEEE.

[17]  Jan C. Schacher,et al.  The Spatial Sound Description Interchange Format: Principles, Specification, and Examples , 2013, Computer Music Journal.

[18]  Qing Zhang,et al.  A Survey on Human Motion Analysis from Depth Data , 2013, Time-of-Flight and Depth Imaging.

[19]  Philip J. B. Jackson,et al.  Audio Object Separation Using Microphone Array Beamforming , 2015 .

[20]  Myung-Suk Song,et al.  An Interactive 3-D Audio System With Loudspeakers , 2011, IEEE Transactions on Multimedia.

[21]  Alexander George Westner,et al.  Object-based audio capture : separating acoustically-mixed sounds , 1999 .

[22]  Frank Melchior,et al.  Creating object-based experiences in the real world , 2016 .

[23]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[24]  Graham Thomas,et al.  State‐of‐the‐Art and Challenges in Media Production, Broadcast and Delivery , 2014 .

[25]  Bruno Fazenda,et al.  A glimpse-based approach for predicting binaural intelligibility with single and multiple maskers in anechoic conditions , 2015, INTERSPEECH.

[26]  Volker Hohmann,et al.  Auditory model based direction estimation of concurrent speakers from binaural signals , 2011, Speech Commun..

[27]  Jan Plogsties,et al.  Design, Coding and Processing of Metadata for Object-Based Interactive Audio , 2014 .

[28]  Kazuho Ono,et al.  Subjective Loudness of 22.2 Multichannel Programs , 2015 .

[29]  Frank Melchior,et al.  Descriptive Analysis of Binaural Rendering with Virtual Loudspeakers Using a Rate-All-That-Apply Approach , 2016 .

[30]  Bhiksha Raj,et al.  Microphone Array Processing for Distant Speech Recognition: From Close-Talking Microphones to Far-Field Sensors , 2012, IEEE Signal Processing Magazine.

[31]  Frank Melchior,et al.  Platform Independent Audio , 2013 .

[32]  Adrian Hilton,et al.  A Listener Position Adaptive Stereo System for Object-Based Reproduction , 2015 .

[33]  Alexey Ozerov,et al.  Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Jan Plogsties,et al.  MPEG-H 3D Audio—The New Standard for Coding of Immersive Spatial Audio , 2015, IEEE Journal of Selected Topics in Signal Processing.

[35]  Jens Spille,et al.  Object-based audio for interactive football broadcast , 2013, Multimedia Tools and Applications.

[36]  Simon J. Godsill,et al.  On sequential Monte Carlo sampling methods for Bayesian filtering , 2000, Stat. Comput..

[37]  Franz Zotter,et al.  All-Round Ambisonic Panning and Decoding , 2012 .

[38]  Takashi Takeuchi,et al.  Optimal source distribution for binaural synthesis over loudspeakers. , 2002, The Journal of the Acoustical Society of America.

[39]  Emmanuel Vincent,et al.  Multichannel Audio Source Separation With Deep Neural Networks , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[40]  Filippo Maria Fazi,et al.  A Listener Adaptive Optimal Source Distribution System for Virtual Sound Imaging , 2016 .

[41]  Francis Rumsey,et al.  Sound and recording: An introduction , 1992 .

[42]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[43]  Tim Bray,et al.  The JavaScript Object Notation (JSON) Data Interchange Format , 2014, RFC.

[44]  Tim Brookes,et al.  Production and Reproduction of Program Material for a Variety of Spatial Audio Formats , 2015 .

[45]  Frank Melchior,et al.  Presenting the S3A object-based audio drama dataset , 2016 .

[46]  Tim Brookes,et al.  Evaluation of spatial audio reproduction methods (part 2) : analysis of listener preference , 2017 .

[47]  Sascha Spors,et al.  Object-based audio reproduction and the audio scene description format , 2010 .

[48]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[49]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[50]  Adrian Hilton,et al.  Identity association using PHD filters in multiple head tracking with depth sensors , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Frank Melchior,et al.  Spatial Sound With Loudspeakers and Its Perception: A Review of the Current State , 2013, Proceedings of the IEEE.

[52]  Philip J. B. Jackson,et al.  A source separation evaluation method in object-based spatial audio , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[53]  Ville Pulkki,et al.  Virtual Sound Source Positioning Using Vector Base Amplitude Panning , 1997 .

[54]  Daniel P. W. Ellis,et al.  Model-Based Expectation-Maximization Source Separation and Localization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[55]  Algorithms to measure audio programme loudness and true-peak audio level , 2011 .

[56]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[57]  Ben Shirley,et al.  Clean Audio for TV broadcast: an object-based approach for hearing impaired viewers , 2015 .

[58]  Piotr Majdak,et al.  The Auditory Modeling Toolbox , 2013 .

[59]  Frank Melchior,et al.  Loudness Matching Multichannel Audio Program Material with Listeners and Predictive Models , 2015 .

[60]  R. Mahler Multitarget Bayes filtering via first-order multitarget moments , 2003 .

[61]  Filippo Maria Fazi,et al.  Sweet-spot-independent binaural reproduction with a listener-adaptive loudspeaker array , 2016 .

[62]  Frank Melchior,et al.  Object-Based Reverberation for Spatial Audio , 2017 .

[63]  Athanasios Mouchtaris,et al.  Inverse Filter Design for Immersive Audio Rendering Over Loudspeakers , 2000, IEEE Trans. Multim..

[64]  Matthias GEIER,et al.  An Open-Source C++ Framework for Multithreaded Realtime Multichannel Audio Applications , 2012 .

[65]  Francesco Piazza,et al.  Intelligent Acoustic Interfaces With Multisensor Acquisition for Immersive Reproduction , 2015, IEEE Transactions on Multimedia.

[66]  Tinne Tuytelaars,et al.  All together now: Simultaneous Detection and Continuous Pose Estimation using a Hough Forest with Probabilistic Locally Enhanced Voting , 2014, BMVC.

[67]  B.D. Van Veen,et al.  Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[68]  Etienne Corteel,et al.  An Open 3D Audio Production Chain Proposed by the Edison 3D Project , 2016 .