Speech and speaker recognition systems degrade rapidly in the presence of mismatch between training and testing conditions. In a real-world scenario, speech is usually recorded through telephone handsets having diverse characteristics, and then transmitted over commercial telephone lines. Varying transmission channels and telephone handsets introduce convolutional distortion resulting in mismatch between training and testing conditions. The objective of this dissertation has been to develop techniques to estimate the channel with which a speech signal has been corrupted and use the estimated channel for normalization, thereby improving system performance.
Conventional approaches to channel normalization such as Cepstral Mean Normalization and Periodogram Averaging are based on averaging the magnitude spectrum. However, averaging the magnitude spectrum results in significant residual speech in the average, leading to a poor channel estimate, particularly of sharp cut-offs and steep roll-offs. This dissertation takes an alternate approach to channel estimation by employing the coherent or complex spectral average instead of the magnitude average, thereby preserving phase information during the averaging process. This significantly reduces the amount of residual speech in the average and results in a much better channel estimate.
The new channel estimation technique, Coherent Spectral Averaging , aims to estimate the channel accurately from the complex spectral average of the channel, corrupted utterance and convert it into an inverse filter and use it for normalization. In addition to the basic Coherent Spectral Averaging technique, an improved version that is based on using a reference utterance is also presented. A refinement process was also developed, that utilized the reference utterance to further enhance the estimated channel.
The Coherent Spectral Averaging technique was evaluated on speaker recognition and channel/handset identification experiments. These experiments were performed on speech corrupted with simulated channels and also on real-world telephone data. Experiments were also performed on data recorded through multiple telephone handsets. The Coherent Spectral Averaging technique, particularly with the reference utterance and with the refinement process yielded excellent channel estimates that closely tracked sharp cut-offs and steep roll-offs in the channel. The Coherent Spectral Averaging techniques also resulted in significant improvement in speaker recognition performance on channel/handset mismatched speech, most often performing better than Cepstral Mean Normalization, and in many cases restoring performance to that under channel free conditions.
[1]
B.S. Atal,et al.
Automatic recognition of speakers from their voices
,
1976,
Proceedings of the IEEE.
[2]
Devang Naik,et al.
Pole-filtered cepstral mean subtraction
,
1995,
1995 International Conference on Acoustics, Speech, and Signal Processing.
[3]
Richard J. Mammone,et al.
Channel-robust speaker identification using modified-mean cepstral mean normalization with frequency warping
,
1999,
1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).
[4]
Sophocles J. Orfanidis,et al.
Optimum Signal Processing: An Introduction
,
1988
.