Automatic steering of microphone array and video camera toward multi-lingual tele-conference through speech-to-speech translation

It is very important for multi-lingual tele-conferencing through speech-to-speech translation to capture distant-talking speech with high quality. In addition, the speaker image is also needed to realize a natural communication in a multi-lingual tele-conference. A microphone array is an ideal candidate as an effective method for capturing distant-talking speech. Uttered speech can be enhanced and speaker image can be captured by steering a microphone array and a video camera in the speaker direction. However, to realize automatic steering of the microphone array and the video camera, it is necessary to localize the talker. To overcome this problem, we propose steering the microphone array and the video camera automatically toward a multilingual tele-conference through speech-to-speech translation. To realize the proposed system, we use the CSP coefficient addition method for speaker localization and the ATR-MATRIX for speechto-speech translation. We conducted experiments in a real room. Direction of Arrival (DOA) estimation rate (i.e., speaker image capturing rate) was 97.7%, speech recognition rate was 90.0%, and TOEIC score was 530 540 points, subject to locating the speaker at 2 meters distance from the microphone array. We also confirmed that the translated speech and the speaker image can be shown immediately after accurately steering the microphone array and the video camera in the speaker direction and translating the speech beamformed by a microphone array in real-time.