Using Voice Transformations to Create Additional Training Talkers for Word Spotting

Speech recognizers provide good performance for most users but the error rate often increases dramatically for a small percentage of talkers who are "different" from those talkers used for training. One expensive solution to this problem is to gather more training data in an attempt to sample these outlier users. A second solution, explored in this paper, is to artificially enlarge the number of training talkers by transforming the speech of existing training talkers. This approach is similar to enlarging the training set for OCR digit recognition by warping the training digit images, but is more difficult because continuous speech has a much larger number of dimensions (e.g. linguistic, phonetic, style, temporal, spectral) that differ across talkers. We explored the use of simple linear spectral warping to enlarge a 48-talker training data base used for word spotting. The average detection rate overall was increased by 2.9 percentage points (from 68.3% to 71.2%) for male speakers and 2.5 percentage points (from 64.8% to 67.3%) for female speakers. This increase is small but similar to that obtained by doubling the amount of training data.

[1]  Thomas F. Quatieri,et al.  Shape invariant time-scale and pitch modification of speech , 1992, IEEE Trans. Signal Process..

[2]  Masanobu Abe,et al.  Voice conversion based on piecewise linear conversion rules of formant frequency and spectrum tilt , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Harris Drucker,et al.  Improving Performance in Neural Networks Using a Boosting Algorithm , 1992, NIPS.

[4]  H. Wakita Normalization of vowels by vocal-tract length and its application to vowel identification , 1977 .

[5]  Richard Lippmann,et al.  Wordspotter training using figure-of-merit back propagation , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Richard Lippmann,et al.  Figure of Merit Training for Detection and Spotting , 1993, NIPS.

[7]  Yoshinori Sagisaka,et al.  Speech spectrum transformation by speaker interpolation , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.