It is very important for multi-lingual tele-conferencing through speech-to-speech translation to capture distant-talking speech with high quality. In addition, the speaker image is also needed to realize a natural communication in a multi-lingual tele-conference. A microphone array is an ideal candidate as an effective method for capturing distant-talking speech. Uttered speech can be enhanced and speaker image can be captured by steering a microphone array and a video camera in the speaker direction. However, to realize automatic steering of the microphone array and the video camera, it is necessary to localize the talker. To overcome this problem, we propose steering the microphone array and the video camera automatically toward a multilingual tele-conference through speech-to-speech translation. To realize the proposed system, we use the CSP coefficient addition method for speaker localization and the ATR-MATRIX for speechto-speech translation. We conducted experiments in a real room. Direction of Arrival (DOA) estimation rate (i.e., speaker image capturing rate) was 97.7%, speech recognition rate was 90.0%, and TOEIC score was 530 540 points, subject to locating the speaker at 2 meters distance from the microphone array. We also confirmed that the translated speech and the speaker image can be shown immediately after accurately steering the microphone array and the video camera in the speaker direction and translating the speech beamformed by a microphone array in real-time.
[1]
Akio Nakamura.
A Speech Translation System Applied to a Real-World Task/Domain and Its Evaluation Using Real-World Speech Data
,
2001
.
[2]
Martin J. Russell,et al.
Integrating audio and visual information to provide highly robust speech recognition
,
1996,
1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.
[3]
G. Carter,et al.
The generalized correlation method for estimation of time delay
,
1976
.
[4]
Yoshinori Sagisaka,et al.
Evaluation of the ATR-matrix speech translation system with a pair comparison method between the system and humans
,
2000,
INTERSPEECH.
[5]
Ea-Ee Jan,et al.
Spatially selective sound capture for speech and audio processing
,
1993,
Speech Commun..
[6]
L. J. Griffiths,et al.
An alternative approach to linearly constrained adaptive beamforming
,
1982
.
[7]
C. Burrus,et al.
Array Signal Processing
,
1989
.
[8]
Satoshi Nakamura,et al.
Localization of multiple sound sources based on a CSP analysis with a microphone array
,
2000,
2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).