AVICAR: audio-visual speech corpus in a car environment

We describe a large audio-visual speech corpus recorded in a car environment, as well as the equipment and procedures used to build this corpus. Data are collected through a multi-sensory array consisting of eight microphones on the sun visor and four video cameras on the dashboard. The script for the corpus consists of four categories: isolated digits, isolated letters, phone numbers, and sentences, all in English. Speakers from various language backgrounds are included, 50 male and 50 female. In order to vary the signal-to-noise ratio, each script has five different noise conditions: idling, driving at 35mph with windows open and closed, and driving at 55mph with windows open and closed. The corpus is available through <http://www.ifp.uiuc.edu/speech/AVICAR/>.

[1]  Victor Zue,et al.  Speech database development at MIT: Timit and beyond , 1990, Speech Commun..

[2]  Maurizio Omologo,et al.  Use of real and contaminated speech for training of a hands-free in-car speech recognizer , 2001, INTERSPEECH.

[3]  Gaël Richard,et al.  The speechdat-car multilingual speech databases for in-car applications: some first validation results , 1999, EUROSPEECH.

[4]  Alex Pentland,et al.  3D modeling and tracking of human lip motions , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[5]  Biing-Hwang Juang,et al.  Minimum error rate training of inter-word context dependent acoustic model units in speech recognition , 1994, ICSLP.

[6]  Thomas S. Huang,et al.  Audio-visual speech modeling using coupled hidden Markov models , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Björn Granström,et al.  Audiovisual representation of prosody in expressive speech communication , 2004, Speech Commun..

[8]  O. L. Frost,et al.  An algorithm for linearly constrained adaptive array processing , 1972 .

[9]  Chin-Hui Lee,et al.  On stochastic feature and model compensation approaches to robust speech recognition , 1998, Speech Commun..

[10]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Michael S. Brandstein,et al.  Microphone Arrays - Signal Processing Techniques and Applications , 2001, Microphone Arrays.

[12]  Louis D. Braida,et al.  Evaluating the articulation index for auditory-visual input. , 1987, The Journal of the Acoustical Society of America.

[13]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[14]  Tsuhan Chen,et al.  Audiovisual speech processing , 2001, IEEE Signal Process. Mag..

[15]  Detlev Langmann,et al.  CSDC - The MoTiV Car-Speech Data Collection , 1998 .

[16]  John H. L. Hansen,et al.  "CU-move": robust speech processing for in-vehicle speech systems , 2000, INTERSPEECH.

[17]  L. J. Griffiths,et al.  An alternative approach to linearly constrained adaptive beamforming , 1982 .

[18]  Hong-Seok Kim,et al.  Performance of an HMM speech recognizer using a real-time tracking microphone array as input , 1999, IEEE Trans. Speech Audio Process..

[19]  Dieter Leckschat,et al.  Optimized second-order gradient microphone for hands-free speech recordings in cars , 2001, Speech Commun..

[20]  K. U. Simmer,et al.  Multi-microphone noise reduction techniques as front-end devices for speech recognition , 2000, Speech Commun..

[21]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[22]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[23]  George R. Doddington,et al.  Recognition of speech under stress and in noise , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[24]  David Salesin,et al.  Modeling and Animating Realistic Faces from Images , 2002, International Journal of Computer Vision.