In order to achieve high-precision speech recognition in real environments phone model adaptation procedures that can rapidly account for a wide range of different speakers and acoustic noise conditions are required. In this paper we propose an unsupervised speaker adaptation method that extends an unsupervised speaker and environment adaptation method based on sufficient statistics from HMMs by performing spectral subtraction and then adding a known noise to the input. Existing methods assume that a model is trained to match each of the different types of background noise that will be the object of recognition and do not consider variations in the signal-to-noise ratio or changes in the background noise for given inputs. In contrast, our method constrains the noise of the input data using an estimation of the noise spectra and then adds a known stable noise to the bleached noise that remains in the input, thereby smoothing out differences between background noises and enabling us to perform recognition with a single set of acoustic models. In addition, with regard to speaker adaptation, we select the set of closest speakers from our database on the basis of a single arbitrary utterance from the test speaker and retrain the acoustic models using the sufficient statistics of those speakers. By combining these two methods we are able to rapidly and accurately adapt to a new speaker. In recognition experiments with a signal-to-noise ratio of 20 dB and in a variety of noise conditions, the proposed method resulted in a recognition rate of 2 percent more than a speaker-independent model matched to the test noise environment for each noise environment, achieving an average recognition performance of 85.1 percent overall. In addition, we conducted a comparison of our method with a standard supervised adaptation technique: maximum likelihood linear regression (MLLR). © 2005 Wiley Periodicals, Inc. Electron Comm Jpn Pt 2, 88(8): 30–41, 2005; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/ecjb.20199
[1]
Kiyohiro Shikano,et al.
Unsupervised noisy environment adaptation algorithm using MLLR and speaker selection
,
2001,
INTERSPEECH.
[2]
S. Boll,et al.
Suppression of acoustic noise in speech using spectral subtraction
,
1979
.
[3]
Kiyohiro Shikano,et al.
Unsupervised speaker adaptation based on sufficient HMM statistics of selected speakers
,
2001,
2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).
[4]
Kiyohiro Shikano,et al.
Julius - an open source real-time large vocabulary recognition engine
,
2001,
INTERSPEECH.
[5]
Philip C. Woodland,et al.
Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models
,
1995,
Comput. Speech Lang..
[6]
Kiyohiro Shikano,et al.
A new phonetic tied-mixture model for efficient decoding
,
2000,
2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).
[7]
Nobuaki Minematsu,et al.
Free software toolkit for Japanese large vocabulary continuous speech recognition
,
2000,
INTERSPEECH.
[8]
Shuichi Itahashi,et al.
JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research
,
1999
.
[9]
Kiyohiro Shikano,et al.
Spectral subtraction in noisy environments applied to speaker adaptation based on HMM sufficient statistics
,
2002,
INTERSPEECH.