Feature joint-state posterior estimation in factorial speech processing models using deep neural networks

This paper proposes a new method for calculating joint-state posteriors of mixed-audio features using deep neural networks to be used in factorial speech processing models. The joint-state posterior information is required in factorial models to perform joint-decoding. The novelty of this work is its architecture which enables the network to infer joint-state posteriors from the pairs of state posteriors of stereo features. This paper defines an objective function to solve an underdetermined system of equations, which is used by the network for extracting joint-state posteriors. It develops the required expressions for fine-tuning the network in a unified way. The experiments compare the proposed network decoding results to those of the vector Taylor series method and show 2.3% absolute performance improvement in the monaural speech separation and recognition challenge. This achievement is substantial when we consider the simplicity of joint-state posterior extraction provided by deep neural networks.

[1]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[2]  John R. Hershey,et al.  Single-Channel Multitalker Speech Recognition , 2010, IEEE Signal Processing Magazine.

[3]  Jonathan Le Roux,et al.  Factorial Models for Noise Robust Speech Recognition , 2012, Techniques for Noise Robustness in Automatic Speech Recognition.

[4]  John R. Hershey,et al.  Monaural speech separation and recognition challenge , 2010, Comput. Speech Lang..

[5]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[6]  Mohammad Mehdi Homayounpour,et al.  A brief survey on deep belief networks and introducing a new object oriented MATLAB toolbox (DeeBNet) , 2014, ArXiv.

[7]  Dong Yu,et al.  Large vocabulary continuous speech recognition with context-dependent DBN-HMMS , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[9]  John R. Hershey,et al.  Super-human multi-talker speech recognition: A graphical modeling approach , 2010, Comput. Speech Lang..

[10]  Mohammad Ali Keyvanrad,et al.  A brief survey on deep belief networks and introducing a new object oriented toolbox ( DeeBNet V 3 . 0 ) , 2016 .

[11]  Yifan Gong,et al.  A unified framework of HMM adaptation with joint compensation of additive and convolutive distortions , 2009, Computer Speech and Language.

[12]  Tuomas Virtanen,et al.  Speech recognition using factorial hidden Markov models for separation in the feature space , 2006, INTERSPEECH.

[13]  Mohammad Mehdi Homayounpour,et al.  Monaural Multi-Talker Speech Recognition using Factorial Speech Processing Models , 2016, Speech Commun..

[14]  van Dalen,et al.  Statistical models for noise-robust speech recognition , 2011 .

[15]  Mohammad Mehdi Homayounpour,et al.  Modeling State-Conditional Observation Distribution Using Weighted Stereo Samples for Factorial Speech Processing Models , 2015, Circuits Syst. Signal Process..

[16]  Brendan J. Frey,et al.  Towards non-stationary model-based noise adaptation for large vocabulary speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[17]  Steve Young,et al.  The HTK book , 1995 .

[18]  Dong Yu,et al.  Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.