QCRI's Live Speech Translation System

In this work, we present Qatar Computing Research Institute»s live speech translation system. Our system works with both Arabic and English. It is designed using an array of modern web technologies to capture speech in real time, and transcribe and translate it using state-of-the-art Automatic Speech Recognition (ASR) and Machine Translation (MT) systems. The platform is designed to be useful in a wide variety of situations like lectures, talks and meetings. It is often the case in the Middle East that audiences in talks understand either Arabic or English alone. This system enables the speaker to talk in either language, and the audience to understand what is being spoken even if they are not bilingual.The system consists of three primary modules, i) a Web application, ii) ASR system, iii) and a statistical/neural MT system. The three modules are optimized to work jointly and process the speech at a real-time factor close to one - which means that the systems are optimized to keep up with the speaker and provide the results with a short delay, comparable to what we observe in (human) interpretation. The real-time factor for the entire pipeline is 1.18. The Web application is based on the standard HTML5 WebAudio application programming interface. It captures speech input from a microphone on the user»s device and transmits it to the backend servers for processing. The servers send back the transcriptions and translations of the speech, which is then displayed to the user. Our platform features a way to instantly broadcast live sessions for anyone to see the transcriptions and translations of a session in real-time without being physically present at the speaker»s location. The ASR system is based on KALDI, a state-of-the-art toolkit for speech recognition. We use a combination of time delay neural networks (TDNN) and long-short term memory neural network (LSTM) to ensure real time transcription of the incoming speech while ensuring high quality output. The Arabic and English systems have average word error rates of 23% and 9.7% respectively. The Arabic system consists of the following components: i) a character based lexicon of size 900K; the lexicon maps words to sound units to learn acoustic representation, ii) 40 dimensional high-resolution features extracted for each speech frame to digitize the audio signal, iii) a 100-dimensional i-vectors for each frame to facilitate speaker adaptation, iv) TDNN acoustic models, and v) Tri-gram language model trained using 110 M words, and restricted to 900 K vocabulary.The MT system has two choices for the backend – a statistical phrase-based system and a neural MT system. Our phrase-based system is trained with Moses, a state-of-the-art statistical MT framework, and the neural-based systems is trained with Nematus, a state-of-the-art neural MT framework. We use Modified Moore-Lewis filtering to select the best subset of the available data to train our phrase-based system more efficiently. In order to speed up the translation even further, we prune the language models backing the phrase-based system, ignoring knowledge that is not frequently used. On the other hand, our neural-based system MT system trained on all the available data as its training scales linearly with the amount of data unlike phrase-based systems. Our Neural MT system is roughly 3–5% better on the BLEU scale, a standard measure for computing the quality of translations. However, the existing neural MT decoders are slower than the phrase-based decoders translating 9.5 tokens/second versus 24 tokens/second. The trade-off between efficiency and accuracy barred us from picking only one final system. By enabling both technologies we allow the trade-off between quality and efficiency and leave it up to the user to decide whether they prefer fast or accurate system.Our system has been successfully demonstrated locally and globally at several venues like Al Jazeera, MIT, BBC and TII. The state-of-the-art technologies backing the platform for transcription and translation are also available independently and can be integrated seamlessly into any external platform. The Speech Translation system is publicly available at http://st.qcri.org/demos/livetranslation.