Recognition using speech synthesis : a reactive dynamic for robust ASR
暂无分享,去创建一个
Automatic Speech Recognition (ASR) systems are not efficient under noisy speech. In the Multi-Stream (MS) approach, commonly used to reinforce ASR robustness, each stream feeds one recognizer generating estimates which are combined through a fusion process. As some streams are optimal for transmission of some phonemes [1,3], it is then interesting to over weight the best stream during the feature extraction and/or the fusion process [1,2]. Contrary to this forward weighting strategy we propose a new one based on a feedback loop from recognition to signal. The key idea is to use the current recognition to construct an Acoustic Image (α) which is compared to the input signal in order to calculate Estimates Accuracy (ρ). Therefore, for each frame t, ρ(t) is the correlation between the input signal Power Spectrum Density PSD(X(t)), and PSD(α(t)) which is the sum of E(PSD(K)), the average PSD of phoneme k (over the labelled 300,000 frames of the training set), weighted by the phoneme posteriors P(qk|X(t)). Therefore PSD(α(t)) = Σk [ P(qk|X(t)) . E(PSD(K)) ] and ρ(t) = Corr[ PSD(X(t)) , PSD(α(t)) ]
[1] Hervé Glotin,et al. Weighting schemes for audio-visual fusion in speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).
[2] Hervé Glotin. Enhanced posteriors bias prediction for robust multi-stream ASR combining voicing and estimate reliabilities , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.