Syllable-Length Acoustic Units in Large-Vocabulary Continuous Speech Recognition

Recent research on the TIMIT corpus suggests that longerlength acoustic units are better suited for modelling coarticulation and long-term temporal dependencies in speech than conventional context-dependent phone models. However, the impressive results achieved on TIMIT [1] are yet to be reproduced on other corpora, such as read speech from the Spoken Dutch Corpus. Differences between TIMIT and the Spoken Dutch Corpus data are analysed in an attempt to better understand in which conditions the use of longer-length units can be expected to result in considerable improvements in recognition accuracy. We conclude that at least part of the improvements found with TIMIT can be explained by details of the experimental procedure, and that longer-length left-to-right HMMs that borrow their topology from a sequence of triphones are only able to capture part of the pronunciation variation present in speech.