Hierarchical set representations of speech

This work demonstrates that a unified hierarchy of non-sequential set representations of speech exists at different temporal scales. This is due to the fact that speech is not only distinguishable as an ordered temporal sequence of elements, but has a high degree of discernibility in a non-temporal sense. This thesis has been evaluated at the acoustic, phonetic, and word levels and provides new insights into the speech recognition problem. The advantages of having set representations as the central framework is apparent at different levels of the speech hierarchy. Distance set representations enable compact acoustic models. The method of phonetic set indexing is a very fast method of pre-fetching word lists. At the word level, the concept of word sets allows for long distance relations between words to be captured, and gives rise to the concept of utterance and dialogue triggers which have been implemented in the context of the derived Trigger and Adaptive Boosting (TAB) algorithm. Voting with unified codebooks for speaker identification is also presented. Experiments have been performed in various speech domains including TIMIT, Trains-93, and Trains-95 corpus. Set representations of speech have been utilized in two novel applications: multi-modal integration, and web browsing with speech. The notion of using loosely synchronized eye-fixation information to improve speech recognition is proposed and evaluated in the TRAINS domain. The novel concept of web triggered word sets is introduced in the World Wide Web(WWW) speech interface system, NetSpeak, for improved HTML link access.