论文信息 - The ICSI Meeting Corpus

The ICSI Meeting Corpus

We have collected a corpus of data from natural meetings that occurred at the International Computer Science Institute (ICSI) in Berkeley, California over the last three years. The corpus contains audio recorded simultaneously from head-worn and table-top microphones, word-level transcripts of meetings, and various metadata on participants, meetings, and hardware. Such a corpus supports work in automatic speech recognition, noise robustness, dialog modeling, prosody, rich transcription, information retrieval, and more. We present details on the contents of the corpus, as well as rationales for the decisions that led to its configuration. The corpus were delivered to the Linguistic Data Consortium (LDC).

[1] R. G. Leonard,et al. A database for speaker-independent digit recognition , 1984, ICASSP.

[2] T. Robinson. Simple Lossless and Near-lossless Waveform Compression , 1994 .

[3] Lynn Wilcox,et al. Meeting Capture in a Media Enriched Conference Room , 1999, CoBuild.

[4] Hagen Soltau,et al. Advances in automatic meeting record creation and access , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[5] Andreas Stolcke,et al. Observations on overlap: findings and implications for automatic processing of multi-party conversation , 2001, INTERSPEECH.

[6] Anoop Gupta,et al. Distributed meetings: a meeting capture and broadcasting system , 2002, MULTIMEDIA '02.

[7] Christopher Cieri,et al. Research methodologies, observations and outcomes in (conversational) speech data collection , 2002 .