A statistical approach to metrics for word and syllable recognition

Time‐warping pattern‐comparison algorithms are widely used in speech recognition. Two words or syllables being compared are described by a series of time frames each containing values of a set of acoustic parameters. After time alignment, the squared distance between the patterns is summed over the parameters within a frame and then across frames. The sum obtained is assumed to be proportional to the log probability of the two patterns having the same identity. This assumption is generally invalid, but it may be made substantially true by analyzing the variability between different examples of the same syllable and adjusting the metric accordingly. Variability is estimated both as a function of frame position within the syllable as a function of the acoustic parameters. In the latter case, within‐ and between‐class covariance matrices can be estimated and standard linear discriminant analysis methods applied. This permits the combination of disparate acoustic parameters into a single distance measure. In particular, combining frame and frame‐difference parameters allows one to use time development information and to take inter‐frame correlations into account.