Encoding Navigable Speech Sources: A Psychoacoustic-Based Analysis-by-Synthesis Approach

This paper presents a psychoacoustic-based analysis-by-synthesis approach for compressing navigable speech sources. The approach targets multi-party teleconferencing applications, where selective reproduction of individual speech sources is desired. Based on exploiting sparsity of speech in the perceptual time-frequency domain, multiple speech signals are encoded into one mono mixture signal, which can be further compressed using a standard speech codec. Using side information indicating the active speech source for each time frequency instant enables flexible decoding and reproduction. Objective results highlight the importance of considering perception when exploiting the sparse nature of speech in the time-frequency domain. Results show that this sparsity, as measured by the preserved energy level of perceptually important time-frequency components extracted from mixtures of speech signals, is similar in both anechoic and reverberant environments. The proposed approach is applied to a series of simulated and real reverberant speech recordings, where the resulting speech mixtures are compressed using a standard speech codec operating at 32 kbps. The perceptual quality, as judged both by objective and subjective evaluations, outperforms a simple sparsity approach that does not consider perception as well as the approach that encodes each source separately. While the perceptual quality of individual speech sources is maintained, subjective tests also confirm the approach maintains the perceptual quality of the spatialized speech scene.

[1]  Pulkki,et al.  Directional Audio Coding: Filterbank and STFT-Based Design , 2006 .

[2]  Ville Pulkki Directional Audio Coding in Spatial Sound Reproduction and Stereo Upmixing , 2006 .

[3]  Guy J. Brown,et al.  Audio spatialisation strategies for multitasking during teleconferences , 2009, INTERSPEECH.

[4]  Oliver Hellmuth,et al.  Spatial Audio Object Coding (SAOC) - The Upcoming MPEG Standard on Parametric Object Based Audio Coding , 2008 .

[5]  Özgür Yilmaz,et al.  Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[6]  Marina Bosi,et al.  Introduction to Digital Audio Coding and Standards , 2004, J. Electronic Imaging.

[7]  Bin Cheng,et al.  Principles and Analysis of the Squeezing Approach to Low Bit Rate Spatial Audio Coding , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[8]  Jeroen Breebaart,et al.  A Study of the MPEG Surround Quality Versus Bit-Rate Curve , 2007 .

[9]  Laurent Girin,et al.  A Watermarking-Based Method for Informed Source Separation of Audio Signals With a Single Sensor , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  J. Hilpert,et al.  The MPEG Surround Audio Coding Standard [Standards in a Nutshell] , 2009, IEEE Signal Processing Magazine.

[11]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[12]  David M. Howard,et al.  Acoustics and Psychoacoustics , 2006 .

[13]  Bin Cheng,et al.  A Spatial Squeezing approach to Ambisonic audio compression , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Ning Xiang,et al.  Acoustics for Engineers , 2009 .

[15]  Sascha Disch,et al.  New Concepts in Parametric Coding of Spatial Audio: From SAC to SAOC , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[16]  Ahmet M. Kondoz,et al.  Multichannel Audio Coding Based on Analysis by Synthesis , 2011, Proceedings of the IEEE.

[17]  R. C. de Lamare,et al.  Strategies to improve the performance of very low bit rate speech coders and application to a variable rate 1.2 kb/s codec , 2005 .

[18]  Laurent Girin,et al.  Informed Source Separation of Linear Instantaneous Under-Determined Audio Mixtures by Source Index Embedding , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  J.B. Millar,et al.  The Australian National Database of Spoken Language , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Jürgen Herre,et al.  MPEG Surround , 2005, IEEE MultiMedia.

[21]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[22]  Jens Blauert,et al.  Acoustics for Engineers: Troy Lectures , 2008 .

[23]  METHODS FOR SUBJECTIVE DETERMINATION OF TRANSMISSION QUALITY Summary , 2022 .

[24]  Jiangtao Xi,et al.  Encoding navigable speech sources: An analysis by synthesis approach , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[26]  Bin Cheng,et al.  Psychoacoustic-based quantisation of spatial audio cues , 2008 .

[27]  Laurent Girin,et al.  An Informed Source Separation System for Speech Signals , 2011, INTERSPEECH.

[28]  Scott Rickard,et al.  Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[29]  Inseon Jang,et al.  Development of Multichannel Sound Scene Visualization Tool with MPEG Surround Multichannel Decoder , 2008, 2008 Digest of Technical Papers - International Conference on Consumer Electronics.

[30]  Sugato Chakravarty,et al.  Method for the subjective assessment of intermedi-ate quality levels of coding systems , 2001 .

[31]  Peter Jax,et al.  A postfilter for echo and noise reduction avoiding the problem of musical tones , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[32]  Giovanni Del Galdo,et al.  Efficient merging of multiple audio streams for spatial sound reproduction in Directional Audio Coding , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  D. Griesinger FURTHER INVESTIGATION INTO THE LOUDNESS OF RUNNING REVERBERATION , 1999 .

[34]  Guy J. Brown,et al.  The influence of audio presentation style on multitasking during teleconferences , 2008, INTERSPEECH.

[35]  Ahmet M. Kondoz,et al.  Digital Speech: Coding for Low Bit Rate Communication Systems , 1995 .