Encoding and communicating navigable speech soundfields

This paper describes a system for encoding and communicating navigable speech soundfields for applications such as immersive audio/visual conferencing, audio surveillance of large spaces and free viewpoint television. The system relies on recording speech soundfields using compact co-incident microphone arrays that are then processed to identify sources and their spatial location using the well-known assumption that speech signals are sparse in the time-frequency domain. A low-delay Direction of Arrival (DOA)-based frequency domain sound source separation approach is proposed that requires only 250 ms of speech signal. Joint compression is achieved through a previously proposed perceptual analysis-by-synthesis spatial audio coding scheme that encodes sources into a mixture signal that can be compressed by a standard speech codec at 32 kbps. By also transmitting side information representing the original spatial location of each source, the received mixtures can be decoded and then flexibly reproduced using loudspeakers at a chosen listening point within a synthesised speech scene. The system was implemented based on this framework for an example application encoding a three-talker navigable speech scene at a total bit rate of 48 kbps. Subjective listening tests were conducted to evaluate the quality of the reproduced speech scenes at a new listening point as compared to a true recording at that point. Results demonstrate the approach successfully encodes multiple spatial speech scenes at low bit rates whilst maintaining perceptual quality in both anechoic and reverberant environments.

[1]  Pj Clarkson,et al.  Conversations, Conferencing and Collaboration: An Asia-Pacific investigation of factors influencing the effectiveness of distributed meetings , 2014 .

[2]  David M. Howard,et al.  Acoustics and Psychoacoustics , 2006 .

[3]  Carlos Busso,et al.  Multimodal Meeting Monitoring: Improvements on Speaker Tracking and Segmentation through a Modified Mixture Particle Filter , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[4]  Matti Karjalainen,et al.  Localization of Amplitude-Panned Virtual Sources I: Stereophonic Panning , 2001 .

[5]  Muawiyath Shujau,et al.  Separation of speech sources using an Acoustic Vector Sensor , 2011, 2011 IEEE 13th International Workshop on Multimedia Signal Processing.

[6]  Scott Rickard,et al.  Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[7]  Jacob Benesty,et al.  Time Delay Estimation and Source Localization , 2008 .

[8]  James A. S. Angus,et al.  Perceived Performance of Loudspeaker-Spatialized Speech for Teleconferencing , 2000 .

[9]  J.B. Millar,et al.  The Australian National Database of Spoken Language , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Arun Ross,et al.  Microphone Arrays , 2009, Encyclopedia of Biometrics.

[11]  Jiangtao Xi,et al.  Collaborative Blind Source Separation Using Location Informed Spatial Microphones , 2013, IEEE Signal Processing Letters.

[12]  Jens Spille,et al.  Object-based audio for interactive football broadcast , 2013, Multimedia Tools and Applications.

[13]  Jiangtao Xi,et al.  A psychoacoustic-based analysis-by-synthesis scheme for jointly encoding multiple audio objects into independent mixtures , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Jessica J. Baldis Effects of spatial audio on memory, comprehension, and preference during desktop conferences , 2001, CHI.

[15]  Boaz Rafaely,et al.  Microphone Array Signal Processing , 2008 .

[16]  Ahmet M. Kondoz,et al.  Acoustic Source Separation of Convolutive Mixtures Based on Intensity Vector Statistics , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[18]  Marina Bosi,et al.  Introduction to Digital Audio Coding and Standards , 2004, J. Electronic Imaging.

[19]  Bin Cheng,et al.  Principles and Analysis of the Squeezing Approach to Low Bit Rate Spatial Audio Coding , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[20]  Christian Ritz,et al.  Encoding Multiple Audio Objects Using Intra-Object Sparsity , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Lucas C. Parra,et al.  Convolutive Blind Source Separation Methods , 2008 .

[22]  Jan Plogsties,et al.  MPEG-H 3D Audio—The New Standard for Coding of Immersive Spatial Audio , 2015, IEEE Journal of Selected Topics in Signal Processing.

[23]  Ahmet M. Kondoz,et al.  Intensity vector direction exploitation for exhaustive blind source separation of convolutive mixtures , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  D. R. Campbell,et al.  A MATLAB Simulation of “ Shoebox ” Room Acoustics for use in Research and Teaching , 2022 .

[25]  Guy J. Brown,et al.  Audio spatialisation strategies for multitasking during teleconferences , 2009, INTERSPEECH.

[26]  V. Pulkki Localization of Amplitude-Panned Virtual Sources II: Two- and Three-Dimensional Panning , 2001 .

[27]  Sharon Gannot,et al.  Adaptive Beamforming and Postfiltering , 2008 .

[28]  Gary W. Elko,et al.  Robust and adaptive spatialized audio for desktop conferencing , 1999 .

[29]  Colin Rose Computational Statistics , 2011, International Encyclopedia of Statistical Science.

[30]  Ying Yu,et al.  A Real-Time SRP-PHAT Source Location Implementation using Stochastic Region Contraction(SRC) on a Large-Aperture Microphone Array , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[31]  Jiangtao Xi,et al.  Encoding Navigable Speech Sources: A Psychoacoustic-Based Analysis-by-Synthesis Approach , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Emanuel A. P. Habets,et al.  Parametric Spatial Sound Processing: A flexible and efficient solution to sound scene acquisition, modification, and reproduction , 2015, IEEE Signal Processing Magazine.

[33]  Sascha Spors,et al.  Sound Field Reproduction , 2008 .

[34]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[35]  Antoine Liutkus,et al.  Coding-Based Informed Source Separation: Nonnegative Tensor Factorization Approach , 2013, IEEE Transactions on Audio, Speech, and Language Processing.