A natural acoustic front-end for Interactive TV in the EU-Project DICIT

“Distant-talking Interfaces for Control of Interactive TV” (DICIT) is a European Union-funded project whose main objective is to integrate distant-talking voice interaction as a complementary modality to the use of a remote control in interactive TV systems. Hands-free and seamless control enables a natural user-system interaction providing a suitable means to greatly ease information retrieval. In the given living room scenario the system recognizes commands spoken by multiple and possibly moving users, even in the presence of background noise and TV surround audio. This paper focuses on the multichannel acoustic frontend (MCAF) processing for acoustic scene interpretation which is based on the combination of multi-channel acoustic echo cancellation, blind source separation, beamforming, acoustic event classification, and multiple speaker localization. The fully functional DICIT prototype consists of the MCAF, automatic speech recognition, natural language understanding, mixed-initiative dialogue and satellite connection.

[1]  Unto K. Laine,et al.  Splitting the unit delay [FIR/all pass filters design] , 1996, IEEE Signal Process. Mag..

[2]  Walter Kellermann,et al.  Combination of Adaptive Feedback Cancellation and Binaural Adaptive Filtering in Hearing Aids , 2009, EURASIP J. Adv. Signal Process..

[3]  Walter Kellermann,et al.  A generalization of blind source separation algorithms for convolutive mixtures based on second-order statistics , 2005, IEEE Transactions on Speech and Audio Processing.

[4]  Walter Kellermann,et al.  Strategies for combining acoustic echo cancellation and adaptive beamforming microphone arrays , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Walter Kellermann,et al.  TRINICON: a versatile framework for multichannel blind signal processing , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Alessio Brutti,et al.  Localization of multiple speakers based on a two step acoustic map analysis , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Walter Kellermann,et al.  Acoustic Echo Cancellation for Surround Sound using Perceptually Motivated Convergence Enhancement , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[8]  E. Hänsler,et al.  Acoustic Echo and Noise Control: A Practical Approach , 2004 .

[9]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[10]  Maurizio Omologo,et al.  Combination of clean and contaminated GMM/SVM for far-field text-independent speaker verification , 2008, INTERSPEECH.

[11]  Walter Kellermann,et al.  Simultaneous localization of multiple sound sources using blind adaptive MIMO filtering , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[12]  Jacob Benesty,et al.  Generalized multichannel frequency-domain adaptive filtering: efficient realization and application to hands-free speech communication , 2005, Signal Process..

[13]  Wolfgang Herbordt Sound capture for human/machine interfaces , 1899 .

[14]  Maurizio Omologo,et al.  Acoustic event classification using a distributed microphone network with a GMM/SVM combined algorithm , 2008, INTERSPEECH.

[15]  J. Flanagan,et al.  Computer‐steered microphone arrays for sound transduction in large rooms , 1985 .

[16]  Walter Kellermann,et al.  WOZ acoustic data collection for interactive TV , 2008, Lang. Resour. Evaluation.

[17]  Jing Huang,et al.  Effective acoustic adaptation for a distant-talking interactive TV system , 2008, INTERSPEECH.

[18]  Unto K. Laine,et al.  Splitting the Unit Delay - Tools for fractional delay filter design , 1996 .

[19]  Michael S. Brandstein,et al.  Microphone Arrays - Signal Processing Techniques and Applications , 2001, Microphone Arrays.

[20]  Alessio Brutti,et al.  Speaker localization based on oriented global coherence field , 2006, INTERSPEECH.