论文信息 - AVICAR: audio-visual speech corpus in a car environment

AVICAR: audio-visual speech corpus in a car environment

We describe a large audio-visual speech corpus recorded in a car environment, as well as the equipment and procedures used to build this corpus. Data are collected through a multi-sensory array consisting of eight microphones on the sun visor and four video cameras on the dashboard. The script for the corpus consists of four categories: isolated digits, isolated letters, phone numbers, and sentences, all in English. Speakers from various language backgrounds are included, 50 male and 50 female. In order to vary the signal-to-noise ratio, each script has five different noise conditions: idling, driving at 35mph with windows open and closed, and driving at 55mph with windows open and closed. The corpus is available through <http://www.ifp.uiuc.edu/speech/AVICAR/>.

[1] Victor Zue,et al. Speech database development at MIT: Timit and beyond , 1990, Speech Commun..

[2] Maurizio Omologo,et al. Use of real and contaminated speech for training of a hands-free in-car speech recognizer , 2001, INTERSPEECH.

[3] Gaël Richard,et al. The speechdat-car multilingual speech databases for in-car applications: some first validation results , 1999, EUROSPEECH.

[4] Alex Pentland,et al. 3D modeling and tracking of human lip motions , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[5] Biing-Hwang Juang,et al. Minimum error rate training of inter-word context dependent acoustic model units in speech recognition , 1994, ICSLP.

[6] Thomas S. Huang,et al. Audio-visual speech modeling using coupled hidden Markov models , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7] Björn Granström,et al. Audiovisual representation of prosody in expressive speech communication , 2004, Speech Commun..

[8] O. L. Frost,et al. An algorithm for linearly constrained adaptive array processing , 1972 .

[9] Chin-Hui Lee,et al. On stochastic feature and model compensation approaches to robust speech recognition , 1998, Speech Commun..

[10] J.N. Gowdy,et al. CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11] Michael S. Brandstein,et al. Microphone Arrays - Signal Processing Techniques and Applications , 2001, Microphone Arrays.

[12] Louis D. Braida,et al. Evaluating the articulation index for auditory-visual input. , 1987, The Journal of the Acoustical Society of America.

[13] H. McGurk,et al. Hearing lips and seeing voices , 1976, Nature.

[14] Tsuhan Chen,et al. Audiovisual speech processing , 2001, IEEE Signal Process. Mag..

[15] Detlev Langmann,et al. CSDC - The MoTiV Car-Speech Data Collection , 1998 .

[16] John H. L. Hansen,et al. "CU-move": robust speech processing for in-vehicle speech systems , 2000, INTERSPEECH.

[17] L. J. Griffiths,et al. An alternative approach to linearly constrained adaptive beamforming , 1982 .

[18] Hong-Seok Kim,et al. Performance of an HMM speech recognizer using a real-time tracking microphone array as input , 1999, IEEE Trans. Speech Audio Process..

[19] Dieter Leckschat,et al. Optimized second-order gradient microphone for hands-free speech recordings in cars , 2001, Speech Commun..

[20] K. U. Simmer,et al. Multi-microphone noise reduction techniques as front-end devices for speech recognition , 2000, Speech Commun..

[21] W. H. Sumby,et al. Visual contribution to speech intelligibility in noise , 1954 .

[22] Juergen Luettin,et al. Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[23] George R. Doddington,et al. Recognition of speech under stress and in noise , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[24] David Salesin,et al. Modeling and Animating Realistic Faces from Images , 2002, International Journal of Computer Vision.