Extending Linear Dynamical Systems with Dynamic Stream Weights for Audiovisual Speaker Localization

An important aspect of audiovisual speaker localization is the appropriate fusion of acoustic and visual observations based on their time-varying reliability. In this study, a framework which incorporates dynamic stream weights into the well-known Kalman filtering framework is proposed to cope with this challenge. The concept of dynamic stream weights has recently been investigated in the context of audiovisual automatic speech recognition, where it was successfully applied to weight audiovisual observations according to their reliability. This study extends that approach to linear dynamical systems and additionally introduces a closed-form solution to compute oracle dynamic stream weights from observation sequences with known state trajectories. The proposed approach is evaluated on audiovisual recordings from a humanoid robot in reverberant environments. The results indicate that incorporating dynamic stream weights allows for efficient data fusion on a per-frame basis, which shows superior performance over conventional Kalman-filter-based state estimation.

[1]  Dorothea Kolossa,et al.  A newem estimationof dynamic stream weights for coupled-HMM-based audio-visual ASR , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Ning Ma,et al.  Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[4]  Dorothea Kolossa,et al.  Learning Dynamic Stream Weights For Coupled-HMM-Based Audio-Visual Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Robert M. Nickel,et al.  Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR , 2016, INTERSPEECH.

[6]  Gerasimos Potamianos,et al.  Discriminative training of HMM stream exponents for audio-visual speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[7]  R. O. Schmidt,et al.  Multiple emitter location and signal Parameter estimation , 1986 .

[8]  Radu Horaud,et al.  Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Gwenn Englebienne,et al.  Multimodal Speaker Diarization , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Britta Wrede,et al.  Computational Audiovisual Scene Analysis in Online Adaptation of Audio-Motor Maps , 2013, IEEE Transactions on Autonomous Mental Development.

[11]  Boaz Rafaely,et al.  Localization of Multiple Speakers under High Reverberation using a Spherical Microphone Array and the Direct-Path Dominance Test , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  H.K. Ekenel,et al.  Kalman filters for audio-video source localization , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[13]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[14]  V. Udayashankara,et al.  Automatic bimodal audiovisual speech recognition: A review , 2014, 2014 International Conference on Contemporary Computing and Informatics (IC3I).

[15]  JongSuk Choi,et al.  Audio-visual integration for human-robot interaction in multi-person scenarios , 2014, Proceedings of the 2014 IEEE Emerging Technology and Factory Automation (ETFA).

[16]  Martin Heckmann,et al.  Environmentally robust audio-visual speaker identification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[17]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[18]  Jean-Philippe Thiran,et al.  On Dynamic Stream Weighting for Audio-Visual Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Masahide Kaneko,et al.  Probabilistic integration of audiovisual information to localize sound source in human-robot interaction , 2003, The 12th IEEE International Workshop on Robot and Human Interactive Communication, 2003. Proceedings. ROMAN 2003..

[20]  Georges Linarès,et al.  Audiovisual speaker diarization of TV series , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).