Multimodal English corpus for automatic speech recognition

A multimodal corpus developed for research of speech recognition based on audio-visual data is presented. Besides usual video and sound excerpts, the prepared database contains also thermovision images and depth maps. All streams were recorded simultaneously, therefore the corpus enables to examine the importance of the information provided by different modalities. Based on the recordings, it is also possible to develop a speech recognition system which analyzes many modalities at the same time. The paper describes the process of multimodal material collection and the post-processing procedure applied to this material. Parameterization methods of signals belonging to different modalities are also proposed.

[1]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[2]  Rainer Lienhart,et al.  An extended set of Haar-like features for rapid object detection , 2002, Proceedings. International Conference on Image Processing.

[3]  Juergen Luettin,et al.  Audio-visual speech recognition , 2000 .

[4]  J. Steketee Spectral emissivity of skin and pericardium. , 1973, Physics in medicine and biology.

[5]  Sridha Sridharan,et al.  A unified approach to multi-pose audio-visual ASR , 2007, INTERSPEECH.

[6]  Maciej Szczodrak,et al.  Detection of Moving Objects in Images Combined from Video and Thermal Cameras , 2013, MCSS.

[7]  Jing Huang,et al.  Audio-visual speech recognition using an infrared headset , 2004, Speech Commun..

[8]  Andrzej Czyzewski,et al.  Human-Computer Interface Based on Visual Lip Movement and Gesture Recognition , 2010, Int. J. Comput. Sci. Appl..

[9]  Chalapathy Neti,et al.  Joint audio-visual speech processing for recognition and enhancement , 2003, AVSP.

[10]  James W. Davis,et al.  The Representation and Recognition of Action Using Temporal Templates , 1997, CVPR 1997.

[11]  Juergen Luettin,et al.  Asynchronous stream modeling for large vocabulary audio-visual speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[12]  Gerasimos Potamianos,et al.  Robust audio-visual speech synchrony detection by generalized bimodal linear prediction , 2009, INTERSPEECH.

[13]  Bjarne K. Ersbøll,et al.  FAME-a flexible appearance modeling environment , 2003, IEEE Transactions on Medical Imaging.

[14]  Steve Young,et al.  The HTK book , 1995 .

[15]  P. DALKA,et al.  Vowel recognition based on acoustic and visual features , 2006 .

[16]  Yoni Bauduin,et al.  Audio-Visual Speech Recognition , 2004 .

[17]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[18]  He Cuiqun,et al.  Infrared Face Recognition Based on Blood Perfusion and Weighted Block-DCT in Wavelet Domain , 2010, 2010 International Conference on Computational Intelligence and Security.

[19]  Roberto Manduchi,et al.  Bilateral filtering for gray and color images , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).