On the application of embedded digit training to speaker independent connected digit recognition

In recent years, several algorithms have been proposed for recognizing a string of connected words (typically digits) by optimally piecing together reference patterns corresponding to the words in the string. Although the algorithms differ greatly in details of implementation, storage requirements, etc., they all have essentially the same performance in that their ability to match the unknown string is related to how well words spoken in isolation can match their counterparts in connected speech. For low rates of articulation (i.e., about 100-130 words per minute) the performance of such connected word recognition systems is excellent. However, as the articulation rate approaches that of continuous discourse (180-300 words per minute) the performance of such connected word recognizers falls dramatically. To partially alleviate these problems a modified training procedure was devised in which multiple versions of each reference word were used. The multiple versions included an isolated form for each word, and 2 versions of the word extracted from the middle of 3 word sequences. One of these embedded reference patterns represented a noncontextual token of the word (i.e., spoken in a format where the words on either side had minimal effect on the acoustic properties at the boundaries), and the second represented a highly contextual token of the word. It was shown that a training algorithm could be devised to obtain these embedded reference tokens, and that when using the multiple reference patterns, the performance in a speaker trained system was greatly improved at faster talking rates. In this paper we show how the embedded training technique can be extended to the case of speaker, independent connected word recognizers. In particular, we show that improved recognition performance on connected digit strings is obtained by using standard clustering procedures on the embedded tokens to give a speaker-independent embedded reference set. We also show that the use of the K-nearest neighbor (KNN) rule leads to additional real improvements in performance for recognizing strings of connected digits. A discussion of the types of problems that remain is given.

[1]  Michael D. Brown,et al.  An algorithm for connected word recognition , 1982, ICASSP.

[2]  L. R. Rabiner,et al.  On the application of vector quantization and hidden Markov models to speaker-independent, isolated word recognition , 1983, The Bell System Technical Journal.

[3]  T.B. Martin,et al.  Practical applications of voice input to machines , 1976, Proceedings of the IEEE.

[4]  G. R. Doddington,et al.  Computers: Speech recognition: Turning theory to practice: New ICs have brought the requisite computer power to speech technology; an evaluation of equipment shows where it stands today , 1981, IEEE Spectrum.

[5]  S. Moshier Talker‐independent speech recognition in commercial environments , 1979 .

[6]  L. Rabiner,et al.  A simplified, robust training procedure for speaker trained, isolated word recognition systems , 1980 .

[7]  J. Canning,et al.  LOGOS - A real time hardware continuous speech recognition system , 1982, ICASSP.

[8]  Stephen E. Levinson,et al.  A conversational-mode airline information and reservation system using speech input and output , 1979, The Bell System Technical Journal.

[9]  H. Sakoe,et al.  Two-level DP-matching--A dynamic programming-based pattern matching algorithm for connected word recognition , 1979 .

[10]  G. W. Hughes,et al.  Minimum Prediction Residual Principle Applied to Speech Recognition , 1975 .

[11]  Lawrence R. Rabiner,et al.  Connected digit recognition using a level-building DTW algorithm , 1981 .

[12]  Jean-Luc Gauvain,et al.  A method for connected word recognition and word spotting on a microprocessor , 1982, ICASSP.

[13]  J. G. Wilpon,et al.  An improved training procedure for connected-digit recognition , 1982, The Bell System Technical Journal.

[14]  L. R. Rabiner,et al.  On the use of energy in LPC-based recognition of isolated words , 1982, The Bell System Technical Journal.

[15]  M. Kuhn,et al.  Improvements in isolated word recognition , 1983 .

[16]  J. Canning,et al.  Real-time hardware continuous speech recognition system , 1982 .

[17]  F. Itakura,et al.  Minimum prediction residual principle applied to speech recognition , 1975 .