A multi-stream audio-video large-vocabulary Mandarin Chinese speech database

We present the acquisition and content of a multi-stream audio-visual large-vocabulary database in Mandarin Chinese. The database consists of 17,000 utterances spoken by 225 people and captured by a set of seven cameras and 12 microphones. We also provide the label files that describe the endpoints of the utterances and the script files that represent the actual pronunciation of speech. The database can be used in audio-visual speech recognition (AVSR) for both large-vocabulary and small tasks, microphone array based speech recognition, audio-visual speaker identification and 3D face modeling.

[1]  Eduardo Lleida,et al.  Robust continuous speech recognition system based on a microphone array , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  Wang Zhiming,et al.  A dynamic viseme model for personalizing a talking head , 2002, 6th International Conference on Signal Processing, 2002..

[3]  Daniel V. Rabinkin Optimum sensor placement for microphone arrays , 1998 .

[4]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Ara V. Nefian,et al.  Audio-visual continuous speech recognition using a coupled hidden Markov model , 2002, INTERSPEECH.

[6]  Terence Sim,et al.  The CMU Pose, Illumination, and Expression (PIE) database , 2002, Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition.

[7]  Farzin Deravi,et al.  Design issues for a digital audio-visual integrated database , 1996 .

[8]  Tsuhan Chen,et al.  Audiovisual speech processing , 2001, IEEE Signal Process. Mag..

[9]  Yoni Bauduin,et al.  Audio-Visual Speech Recognition , 2004 .

[10]  Hui Zhang,et al.  Three-dimensional animated face modeling from stereo video , 2002, IS&T/SPIE Electronic Imaging.

[11]  Gerasimos Potamianos,et al.  Speaker independent audio-visual database for bimodal ASR , 1997, AVSP.

[12]  Juergen Luettin,et al.  Evaluation Protocol for the extended M2VTS Database (XM2VTSDB) , 1998 .

[13]  Rainer Lienhart,et al.  An extended set of Haar-like features for rapid object detection , 2002, Proceedings. International Conference on Image Processing.

[14]  Hervé Glotin,et al.  Large-vocabulary audio-visual speech recognition: a summary of the Johns Hopkins Summer 2000 Workshop , 2001, 2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No.01TH8564).

[15]  Stephen W. K. Fu,et al.  A Survey on Chinese Speech Recognition , 1995 .