Improving wordspotting performance with limited training data

This thesis addresses the problem of limited training data in pattern detection problems where a small number of target classes must be detected in a varied background. There is typically limited training data and limited knowledge about class distributions in this type of spotting problem and in this case a statistical pattern classifier can not accurately model class distributions. The domain of wordspotting is used to explore new approaches that improve spotting system performance with limited training data. First, a high performance, state-of-the-art whole-word based wordspotter is developed. Two complementary approaches are then introduced to help compensate for the lack of data. Figure of Merit training, a new type of discriminative training algorithm, modifies the spotting system parameters according to the metric used to evaluate wordspotting systems. The effectiveness of discriminative training approaches may be limited due to overtraining a classifier on insufficient training data. While the classifier's performance on the training data improves, the classifier's performance on unseen test data degrades. To alleviate this problem, voice transformation techniques are used to generate more training examples that improve the robustness of the spotting system. The wordspotter is trained and tested on the Switchboard credit-card database, a database of spontaneous conversations recorded over the telephone. The baseline wordspotter achieves a Figure of Merit of 62.5% on a testing set. With Figure of Merit training, the Figure of Merit improves to 65.8%. When Figure of Merit training and voice transformations are used together, the Figure of Merit improves to 71.9%. The final wordspotter system achieves a Figure of Merit of 64.2% on the National Institute of Standards and Technology (NIST) September 1992 official benchmark, surpassing the 1992 results from other whole-word based wordspotting systems. Thesis Co-Supervisor: Richard P. Lippmann Title: Senior Technical Staff Thesis Co-Supervisor: David H. Staelin Title: Professor of Electrical Engineering

[1]  Alan V. Oppenheim,et al.  Discrete-Time Signal Pro-cessing , 1989 .

[2]  Richard Lippmann,et al.  A Boundary Hunting Radial Basis Function Classifier which Allocates Centers Constructively , 1992, NIPS.

[3]  M. Blomberg,et al.  Synthetic phoneme prototypes in a connected-word speech recognition system , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[4]  Raj Reddy,et al.  Large-vocabulary speaker-independent continuous speech recognition: the sphinx system , 1988 .

[5]  Kai-Fu Lee,et al.  Corrective and reinforcement learning for speaker-independent continuous speech recognition , 1989, EUROSPEECH.

[6]  Richard Lippmann,et al.  Review of Neural Networks for Speech Recognition , 1989, Neural Computation.

[7]  Richard Lippmann,et al.  Figure of Merit Training for Detection and Spotting , 1993, NIPS.

[8]  E. M. Hofstetter,et al.  Techniques for task independent word spotting in continuous speech messages , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Helen Meng,et al.  The Use of Distinctive Features for Automatic Speech Recognition , 1991 .

[10]  Nancy A. Daly Acoustic-phonetic and linguistic analyses of spontaneous speech: implications for speech understanding , 1994 .

[11]  Yuchun Lee,et al.  Classifiers : adaptive modules in pattern recognition systems , 1989 .

[12]  J. Makhoul,et al.  Iterative normalization for speaker-adaptive training in continuous speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[13]  Richard Lippmann,et al.  A comparison of signal processing front ends for automatic word recognition , 1995, IEEE Trans. Speech Audio Process..

[14]  B. Chigier,et al.  Rejection and keyword spotting algorithms for a directory assistance city name recognition application , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Sheri Hunnicutt,et al.  Speech recognition based on a text-to-speech synthesis system , 1987, ECST.

[16]  Richard M. Stern,et al.  Sources of degradation of speech recognition in the telephone network , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[18]  Harris Drucker,et al.  Improving Performance in Neural Networks Using a Boosting Algorithm , 1992, NIPS.

[19]  James K. Baker,et al.  Stochastic modeling as a means of automatic speech recognition. , 1975 .

[20]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[21]  Hynek Hermansky,et al.  RASTA-PLP speech analysis technique , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22]  Alex Acero,et al.  Rejection techniques for digit recognition in telecommunication applications , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Richard Rose,et al.  Discriminant wordspotting techniques for rejecting non-vocabulary utterances in unconstrained speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[24]  Jay G. Wilpon,et al.  A two pass classifier for utterance rejection in keyword spotting , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[26]  Herbert Gish,et al.  Phonetic training and language modeling for word spotting , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[27]  George R. Doddington,et al.  An integrated pitch tracking algorithm for speech systems , 1983, ICASSP.

[28]  Masanobu Abe,et al.  Voice conversion based on piecewise linear conversion rules of formant frequency and spectrum tilt , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Yaser S. Abu-Mostafa,et al.  A Method for Learning From Hints , 1992, NIPS.

[30]  Chin-Hui Lee,et al.  Automatic recognition of keywords in unconstrained speech using hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..

[31]  Thomas F. Quatieri,et al.  Shape invariant time-scale and pitch modification of speech , 1992, IEEE Trans. Signal Process..

[32]  Alvin W. Drake,et al.  Fundamentals of Applied Probability Theory , 1967 .

[33]  Sara H. Basson,et al.  NTIMIT: a phonetically balanced, continuous speech, telephone bandwidth speech database , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[34]  M. A. Bush,et al.  Training and search algorithms for an interactive wordspotting system , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[35]  Steve Renals,et al.  Connectionist probability estimation in the DECIPHER speech recognition system , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[36]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[37]  R. Schwartz,et al.  Rapid speaker adaptation using a probabilistic spectral mapping , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[38]  J. N. Marcus A novel algorithm for HMM word spotting performance evaluation and error analysis , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[39]  T.F. Quatieri,et al.  The effects of telephone transmission degradations on speaker recognition performance , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[40]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[41]  H. Wakita Normalization of vowels by vocal-tract length and its application to vowel identification , 1977 .

[42]  K.F. Lee,et al.  On speaker-independent, speaker-dependent, and speaker-adaptive speech recognition , 1993, IEEE Trans. Speech Audio Process..

[43]  Herbert Gish,et al.  A segmental speech model with applications to word spotting , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[44]  Alex Waibel,et al.  A hybrid neural network, dynamic programming word spotter , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[45]  Yunxin Zhao A new speaker adaptation technique using very short calibration speech , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[46]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[47]  Paul Duchnowski,et al.  A new structure for automatic speech recognition , 1993 .

[48]  Alexander H. Waibel,et al.  Improving the MS-TDNN for word spotting , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[49]  Alex Acero,et al.  Discriminative training of garbage model for non-vocabulary utterance rejection , 1994, ICSLP.

[50]  Ronald W. Schafer,et al.  Digital Processing of Speech Signals , 1978 .

[51]  R.P. Lippmann,et al.  Pattern classification using neural networks , 1989, IEEE Communications Magazine.