A connected speech recognition system based on spotting diphone-like segments--Preliminary results

A template-based connected speech recognition system, which represents words as sequences of diphone-like segments, has been implemented and evaluated. The inventory of segments is divided into two principal classes: "steady-state" speech sounds such as vowels, fricatives, and nasals, and "composite" speech sounds consisting of sequences of two or more speech sounds in which the transitions from one sound to another are intrinsic to the representation of the composite sound. Templates representing these segments are extracted from labelled training utterances. Words are represented by network models whose branches are diphone segments. Word juncture phenomena are accommodated by including segment branches that characterize transition pronunciations between specified claases of words. The recognition of a word in a specified utterance takes place by "spotting" all the segments contained in the model of the word. Putative words and word combinations are found by searching for best scoring sequences of segments specified by the models subject to segment separation constraints. A pruning procedure finds the best scoring string of words subject to constraints on word lengths, separations, and overlaps. An evaluation of the recognizer has been carried out on a database of connected digit utterances spoken by a single male talker. Templates are extracted from half the database consisting of 2100 digit utterances and system performance tested on the remaining 2100 utterances. The performance obtained to date is approximately 2% digit error rate and 7 to 8% digit string error rate.

[1]  R. Pieraccini,et al.  Definition and evaluation of phonetic units for speech recognition by hidden Markov models , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  R. Nakatsu,et al.  Japanese text input system based on continuous speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Aaron E. Rosenberg,et al.  Demisyllable-based isolated word recognition system , 1983 .

[4]  Lawrence R. Rabiner,et al.  A segmental k-means training procedure for connected word recognition , 1986, AT&T Technical Journal.

[5]  A. Colla,et al.  Unsupervised bootstrapping of diphone-like templates for connected speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Carlo Scagliola,et al.  A connected speech recognition system using a diphone-based language model , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Takao Watanabe Syllable recognition for continuous Japanese speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.