论文信息 - Enhancement, segmentation, and synthesis of speech with application to robust speaker recognition

Enhancement, segmentation, and synthesis of speech with application to robust speaker recognition

This thesis addresses four distinct problems related to robust speaker recognition. The main contributions include new methods for additive colored noise suppression, improved robustness of speech segmentation to noise and channel corruption, compu- tationally efficient speaker recognition, as well as trainable speaker-dependent speech synthesis. The first contribution extends a class of constrained iterative speech enhancement algorithms to better account for the frequency-dependent signal-to-noise ratio (SNR) of colored noise. Spectral constraints, applied between iterations, are adapted across both time and frequency by considering a frequency subband decomposition of the corrupted speech. The proposed algorithm adjusts the terminating iteration count within each signal subband in order to reduce spectral smoothing in high SNR regions, while allowing improved noise attenuation in low SNR regions. The second contribution investigates methods for improving the robustness of hidden Markov model (HMM) based speech segmentation. Previous work has focused on segmenting speech recorded in ideal noise-free conditions. However, in noise or channel distortions, segmentation accuracy can be substantially reduced. Therefore, compensation methods including speech enhancement, model and parameter adaptation, and phone duration modeling are compared for mitigating the impact of additive and convolutional noise. A new segmentation confidence measure is also proposed for adverse environments and subsequently shown to detect gross time-alignment errors. The third contribution improves the efficiency of Gaussian Mixture Model (GMM) based speaker identification. Specifically, observation sequence reordering is proposed to improve the statistical independence assumption of GMMs. Observation reordering allows the feature space of the voice under test to be rapidly sampled, thereby reducing the computation by a factor of 6 compared to sequential observation sampling with beam-search while resulting in no loss in identification accuracy. Finally, the fourth contribution formulates two new speech spectrum modeling methods for trainable speech synthesis. The first algorithm models speaker-dependent sequences of Line Spectral Frequencies (LSFs) using HMMs. The second algorithm is based on trajectory modeling of LSFs. A state-tied excitation is integrated into the model to further convey speaker-dependent voice characteristics. The synthesis algorithms are shown to provide difficult imposter tests for an automatic GMM-based speaker verifier. Contributions in these four related areas have resulted in new directions toward achieving robust speaker recognition. n.

John H. L. Hansen | Bryan L. Pellom