Mean temporal distance: Predicting ASR error from temporal properties of speech signal

Extending previous work on prediction of phoneme recognition error from unlabeled data that were corrupted by unpredictable factors, the current work investigates a simple but effective method of estimating ASR performance by computing a function M(Δt), which represents the mean distance between speech feature vectors evaluated over certain finite time interval, determined as a function of temporal distance Δt between the vectors. It is shown that M(Δt) is a function of signal-to-noise ratio of speech signal. Comparing M(Δt) curves, derived on data used for training of the classifier, and on test utterances, allows for predicting error on the test data. Another interesting observation is that M(Δt) remains approximately constant, as temporal separation Δt exceeds certain critical interval (about 200 ms), indicating the extent of coarticulation in speech sounds.