论文信息 - Low-latency incremental speech transcription in the synface project

Low-latency incremental speech transcription in the synface project

This thesis presents work in the area of automatic speech recognition (ASR). The thesis focuses on methods for increasing the efficiency of speech recognition systems and on techniques for efficient representation of different types of knowledge in the decoding process. In this work, several decoding algorithms and recognition systems have been developed, aimed at various recognition tasks. The thesis presents the KTH large vocabulary speech recognition system. The system was developed for online (live) recognition with large vocabularies and complex language models. The system utilizes weighted transducer theory for efficient representation of different knowledge sources, with the purpose of optimizing the recognition process. A search algorithm for efficient processing of hidden Markov models (HMMs) is presented. The algorithm is an alternative to the classical Viterbi algorithm for fast computation of shortest paths in HMMs. It is part of a larger decoding strategy aimed at reducing the overall computational complexity in ASR. In this approach, all HMM computations are completely decoupled from the rest of the decoding process. This enables the use of larger vocabularies and more complex language models without an increase of HMM-related computations. Ace is another speech recognition system developed within this work. It is a platform aimed at facilitating the development of speech recognizers and new decoding methods. A real-time system for low-latency online speech transcription is also presented. The system was developed within a project with the goal of improving the possibilities for hard-of-hearing people to use conventional telephony by providing speech-synchronized multimodal feedback. This work addresses several additional requirements implied by this special recognition task.

Alexander Seward

[1] Björn Granström,et al. Bottlenecks in the Synface Telephone , 2003 .

[2] Jonas Beskow,et al. Animation of talking agents , 1997, AVSP.

[3] Lori Lamel,et al. Speaker-independent continuous speech dictation , 1993, Speech Communication.

[4] Richard Winski,et al. European speech databases for telephone applications , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5] Steve Young,et al. Token passing: a simple conceptual model for connected speech recognition systems , 1989 .

[6] Andrej Ljolje,et al. The AT&T LVCSR-2000 System , 2000 .

[7] Narada D. Warakagoda,et al. A Noise Robust Multilingual Reference Recogniser Based on Speechdat(II) , 2000, INTERSPEECH.

[8] Björn Granström,et al. SYNFACE - a project presentation , 2002 .

[9] Kjell Elenius,et al. Experiences from Collecting Two Swedish Telephone Speech Databases , 2000, Int. J. Speech Technol..