A multi-channel/multi-speaker interactive 3D audio-visual speech corpus in Mandarin

This paper presents a multi-channel/multi-speaker 3D audio-visual corpus for Mandarin continuous speech recognition and other fields, such as speech visualization and speech synthesis. This corpus consists of 24 speakers with about 18k utterances, about 20 hours in total. For each utterance, the audio streams were recorded by two professional microphones in near-field and far-field respectively, while a marker-based 3D facial motion capturing system with six infrared cameras was used to acquire the 3D video streams. Besides, the corresponding 2D video streams were captured by an additional camera as a supplement. A data process is described in this paper for synchronizing audio and video streams, detecting and correcting outliers, and removing head motions during recording. Finally, results about data process are also discussed. As so far, this corpus is the largest 3D audio-visual corpus for Mandarin.

[1]  Lan Wang,et al.  Multi-level adaptive network for accented Mandarin speech recognition , 2014, 2014 4th IEEE International Conference on Information Science and Technology.

[2]  John Mason,et al.  Robust voice activity detection using cepstral features , 1993, Proceedings of TENCON '93. IEEE Region 10 International Conference on Computers, Communications and Automation.

[3]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Mohammed Bennamoun,et al.  A 3D Audio-visual Corpus for Speech Recognition , 2012 .

[5]  Javier R. Movellan,et al.  Visual Speech Recognition with Stochastic Networks , 1994, NIPS.

[6]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[7]  Conrad Sanderson,et al.  Biometric Person Recognition: Face, Speech and Fusion , 2008 .

[8]  J. Kalita,et al.  Outlier Identification using Symmetric Neighborhoods , 2012 .

[9]  Hui Chen,et al.  A multi-channel/multi-speaker articulatory database in Mandarin for speech visualization , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[10]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[11]  Juergen Luettin,et al.  A comparison of model and transform-based visual features for audio-visual LVCSR , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[12]  James R. Glass,et al.  A segment-based audio-visual speech recognizer: data collection, development, and initial experiments , 2004, ICMI '04.

[13]  Naomi Harte,et al.  TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech , 2015, IEEE Transactions on Multimedia.

[14]  Andrea F. Abate,et al.  2D and 3D face recognition: A survey , 2007, Pattern Recognit. Lett..

[15]  Tieniu Tan,et al.  Depth vs. intensity: which is more important for face recognition? , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..