A Dynamic Stream Weight Backprop Kalman Filter for Audiovisual Speaker Tracking

Audiovisual speaker tracking is an application that has been tackled by a wide range of classical approaches based on Gaussian filters, most notably the well-known Kalman filter. Recently, a specific Kalman filter implementation was proposed for this task, which incorporated dynamic stream weights to explicitly control the influence of acoustic and visual observations during estimation. Inspired by recent progress in the context of integrating uncertainty estimates into modern deep learning frameworks, this paper proposes a deep neural-network-based implementation of the Kalman filter with dynamic stream weights, whose parameters can be learned via standard backpropagation. This allows for jointly optimizing the parameters of the model and the dynamic stream weight estimator in a unified framework. An experimental study on audiovisual speaker tracking shows that the proposed model shows comparable performance to state-of-the-art recurrent neural networks with the additional advantage of requiring a smaller number of parameters and providing explicit uncertainty information.

[1]  Aggelos K. Katsaggelos,et al.  Feature space video stream consistency estimation for dynamic stream weighting in audio-visual speech recognition , 2008, 2008 15th IEEE International Conference on Image Processing.

[2]  Archontis Politis,et al.  Direction of Arrival Estimation for Multiple Sound Sources Using Convolutional Recurrent Neural Network , 2017, 2018 26th European Signal Processing Conference (EUSIPCO).

[3]  Aggelos K. Katsaggelos,et al.  Audiovisual Fusion: Challenges and New Approaches , 2015, Proceedings of the IEEE.

[4]  Maximo Cobos,et al.  A Modified SRP-PHAT Functional for Robust Real-Time Sound Source Localization With Scalable Spatial Sampling , 2011, IEEE Signal Processing Letters.

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Jean-Marc Odobez,et al.  Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Martin A. Riedmiller,et al.  Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images , 2015, NIPS.

[8]  Ivan Dokmanic,et al.  Pyroomacoustics: A Python Package for Audio Room Simulation and Array Processing Algorithms , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Martin Heckmann,et al.  Environmentally robust audio-visual speaker identification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[10]  DeLiang Wang,et al.  Binaural Localization of Multiple Sources in Reverberant and Noisy Environments , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  VirtanenTuomas,et al.  Detection and Classification of Acoustic Scenes and Events , 2018 .

[12]  Radu Horaud,et al.  Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Gwenn Englebienne,et al.  Multimodal Speaker Diarization , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[15]  Junichi Yamagishi,et al.  CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2017 .

[16]  Josephine Sullivan,et al.  One millisecond face alignment with an ensemble of regression trees , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[18]  Stephan Gerlach,et al.  2D Audio-Visual Localization in Home Environments using a Particle Filter , 2012, ITG Conference on Speech Communication.

[19]  Ole Winther,et al.  A Disentangled Recognition and Nonlinear Dynamics Model for Unsupervised Learning , 2017, NIPS.

[20]  Junichi Yamagishi,et al.  SUPERSEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2016 .

[21]  Dorothea Kolossa,et al.  Extending Linear Dynamical Systems with Dynamic Stream Weights for Audiovisual Speaker Localization , 2018, 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC).

[22]  Xin Liu,et al.  Audio-Visual Speaker Recognition via Multi-modal Correlated Neural Networks , 2016, 2016 IEEE/WIC/ACM International Conference on Web Intelligence Workshops (WIW).

[23]  Maja Pantic,et al.  End-to-End Audiovisual Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  H.K. Ekenel,et al.  Kalman filters for audio-video source localization , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[25]  Georges Linarès,et al.  Audiovisual speaker diarization of TV series , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Ahmed Hussen Abdelaziz Comparing Fusion Models for DNN-Based Audiovisual Continuous Speech Recognition , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Harit Pandya,et al.  Recurrent Kalman Networks: Factorized Inference in High-Dimensional Deep Feature Spaces , 2019, ICML.

[28]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[29]  Dorothea Kolossa,et al.  Audiovisual Speaker Tracking Using Nonlinear Dynamical Systems With Dynamic Stream Weights , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30]  Sergey Levine,et al.  Backprop KF: Learning Discriminative Deterministic State Estimators , 2016, NIPS.