论文信息 - A multi-fpga 10x-real-time high-speed search engine for a 5000-word vocabulary speech recognizer

A multi-fpga 10x-real-time high-speed search engine for a 5000-word vocabulary speech recognizer

Today's best quality speech recognition systems are implemented in software. These systems fully occupy the resources of a high-end server to deliver results at real-time speed: each hour of audio requires a significant fraction of an hour of computation for recognition. This is profoundly limiting for applications that require extreme recognition speed, for example, high-volume tasks such as video indexing (e.g., YouTube), or high-speed tasks such as triage of homeland security intelligence. We describe the architecture and implementation of one critical component -- the backend search stage -- of a high-speed, large-vocabulary recognizer. Implemented on a multi-FPGA Berkeley Emulation Engine 2 (BEE2) platform, we handle a standard 5000-word Wall Street Journal speech benchmark. Our backend search engine can decode on average 10 times faster than real-time running at 100 MHz, i.e, 10x faster than real-time, with negligible degradation in accuracy, running at a clock rate ~ 30x slower than a conventional server. To the best of our knowledge, this is both the most complex, and the fastest recognizer ever to be realized in a hardware form.

Rob A. Rutenbar | Edward C. Lin

[1] Andrew J. Viterbi,et al. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[2] M. Lowy,et al. A dynamic-time-warp integrated circuit for a 1000-word speech recognition system , 1987 .

[3] Jan M. Rabaey,et al. Integrated circuits for a real-time large-vocabulary continuous speech recognition system , 1991 .

[4] Biing-Hwang Juang,et al. Hidden Markov Models for Speech Recognition , 1991 .

[5] Mei-Yuh Hwang,et al. Subphonetic modeling with Markov states-Senone , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6] Colin MacCabe. The Talking Cure , 1981 .

[7] Alex Acero,et al. Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[8] David Pallett,et al. A look at NIST'S benchmark ASR tests: past, present, and future , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[9] Rob A. Rutenbar,et al. A 1000-word vocabulary, speaker-independent, continuous live-mode speech recognizer implemented in a single FPGA , 2007, FPGA '07.

[10] Neal Leavitt. Let's Hear It for Audio Mining , 2002, Computer.

[11] John Wawrzynek,et al. BEE2: a high-end reconfigurable computing system , 2005, IEEE Design & Test of Computers.

[12] Zhen Fang,et al. A low-power accelerator for the SPHINX 3 speech recognition system , 2003, CASES '03.

[13] Eric A. Brewer,et al. Hardware speech recognition for user interfaces in low cost, low power devices , 2005, Proceedings. 42nd Design Automation Conference, 2005..

[14] Scott A. Mahlke,et al. Architectural optimizations for low-power, real-time speech recognition , 2003, CASES '03.