Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model

In typical multi-talker speech recognition systems, a neural networkbased acoustic model predicts senone state posteriors for each speaker. These are later used by a single-talker decoder which is applied on each speaker-specific output stream separately. In this work, we argue that such a scheme is sub-optimal and propose a principled solution that decodes all speakers jointly. We modify the acoustic model to predict joint state posteriors for all speakers, enabling the network to express uncertainty about the attribution of parts of the speech signal to the speakers. We employ a joint decoder that can make use of this uncertainty together with higher-level language information. For this, we revisit decoding algorithms used in factorial generative models in early multi-talker speech recognition systems. In contrast with these early works, we replace the GMM acoustic model with DNN, which provides greater modeling power and simplifies part of the inference. We demonstrate the advantage of joint decoding in proof of concept experiments on a mixed-TIDIGITS dataset.

[1]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[2]  John R. Hershey,et al.  Monaural speech separation and recognition challenge , 2010, Comput. Speech Lang..

[3]  Dong Yu,et al.  Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training , 2017, Speech Commun..

[4]  Alan Edelman,et al.  Julia: A Fresh Approach to Numerical Computing , 2014, SIAM Rev..

[5]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[6]  Xiaofei Wang,et al.  The Hitachi/JHU CHiME-5 system: Advances in speech recognition for everyday home environments using multiple microphone arrays , 2018 .

[7]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[8]  John R. Hershey,et al.  Super-human multi-talker speech recognition: A graphical modeling approach , 2010, Comput. Speech Lang..

[9]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[11]  Dong Yu,et al.  Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  John R. Hershey,et al.  Single Channel Speech Separation Using Factorial Dynamics , 2006, NIPS.

[13]  Luk'avs Burget,et al.  GPU-Accelerated Forward-Backward algorithm with Application to Lattice-Free MMI , 2021, ArXiv.

[14]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[16]  John R. Hershey,et al.  Hierarchical variational loopy belief propagation for multi-talker speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[17]  William T. Freeman,et al.  On the optimality of solutions of the max-product belief-propagation algorithm in arbitrary graphs , 2001, IEEE Trans. Inf. Theory.

[18]  Dong Yu,et al.  Monaural Multi-Talker Speech Recognition with Attention Mechanism and Gated Convolutional Networks , 2018, INTERSPEECH.

[19]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[20]  Jon Barker,et al.  The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[21]  Jon Barker,et al.  CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings , 2020, 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).

[22]  John R. Hershey,et al.  Single-Channel Multitalker Speech Recognition , 2010, IEEE Signal Processing Magazine.