Acoustic modeling and speaker normalization strategies with application to robust in-vehicle speech recognition and dialect classification
暂无分享,去创建一个
The speech signal contains multiple levels of blended information which include the linguistic level such as the spoken message (content), language, accent/dialect, speaker-specific level such as the gender, emotion, stress, age, physical size (of speakers vocal tract), speaker identity, and environmental characteristics such as communication channel frequency response, microphone/recording media, and background noise. This dissertation focuses on improved automatic speech recognition in noise, and dialect speech conditions. Specifically, improved acoustic modeling is considered for in-vehicle environments. Reduction of inter-speaker variability within the feature set to increase the recognition performance is also considered. Finally the proposed algorithms are applied to dialect classification problem.
The first phase develops new front-ends for speech recognition in noisy car environments. We propose two new acoustic front-ends based on the MVDR method. The primary contribution is the formulation of a novel perceptual MVDR-based feature, the PMVDR front-end. We show that the PMVDR front-end outperforms previously proposed MVDR-based front-ends on standardized speech recognition tasks.
The second phase proposes a Built-In Speaker Normalization (BISN) algorithm which is similar to traditional Vocal Tract Length Normalization (VTLN). However, several improvements to the search stage are integrated to reduce the computational resources. Finally, an on-the-fly version is introduced and evaluated within the PMVDR framework. This implementation makes it possible to employ speaker normalization seamlessly within the front-end and re-apply/enhance the speaker normalization process as more speaker data becomes available.
In the final section, the proposed PMVDR acoustic front-end and BISN speaker normalization algorithm are applied for dialect classification. Since dialect differences in speech are observable at the phoneme level, the proposed classification algorithms are able to make use of a better acoustic front-end. Moreover, as speaker variability is reduced, it is expected that dialect-dependent traits of the input speech will become more dominant, thereby improving classification performance.
The formulated algorithms are shown to be valuable for speech parameterization and speaker normalization for real world tasks. Moreover, their successful application to a different speech classification problem in dialect confirms their importance and potential long-term impact in the field of speech processing and language technology, beyond the problem of robust automatic speech recognition.