Switching linear dynamic transducer for stereo data based speech feature mapping

The performance of a speech recognition system may be degraded even without any background noise because of the linear or non-linear distortions incurred by recording devices or reverberations. One of the well-known approaches to reduce this channel distortion is feature mapping which maps the distorted speech feature to its clean counterpart. The feature mapping rule is usually trained based on a set of stereo data which consists of the simultaneous recordings obtained in both the reference and target conditions. In this paper, we propose a novel approach to speech feature sequence mapping based on the switching linear dynamic transducer (SLDT). The proposed algorithm enables us a sequence-to-sequence mapping in a systematic way, instead of the traditional vector-to-vector mapping. The proposed approach is applied to compensate channel distortion in speech recognition and shows improvement in recognition performance.

[1]  Matthias Wölfel,et al.  Enhanced Speech Features by Single-Channel Joint Compensation of Noise and Reverberation , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[3]  Li Deng,et al.  High-performance robust speech recognition using stereo training data , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[4]  Nam Soo Kim and Joon-Hyuk Chang Statistical Model based Techniques for Robust Speech Communication , 2011 .

[5]  Nam Soo Kim Statistical linear approximation for environment compensation , 1998, IEEE Signal Processing Letters.

[6]  Richard M. Stern,et al.  Feature compensation based on switching linear dynamic model , 2005, IEEE Signal Processing Letters.

[7]  Li Deng,et al.  Evaluation of the SPLICE algorithm on the Aurora2 database , 2001, INTERSPEECH.

[8]  Alexander Wong,et al.  KPAC: A Kernel-Based Parametric Active Contour Method for Fast Image Segmentation , 2010, IEEE Signal Processing Letters.

[9]  Oscar Saz-Torralba,et al.  Unsupervised Data-Driven Feature Vector Normalization With Acoustic Model Adaptation for Robust Speech Recognition , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Mari Ostendorf,et al.  ML estimation of a stochastic linear system with the EM algorithm and its application to speech recognition , 1993, IEEE Trans. Speech Audio Process..