New telephone speech corpora at CSLU

The Center for Spoken Language Understanding (CSLU) collects, annotates and distributes telephone speech data to enable research in spoken language understanding and automatic language identiication. This paper gives a brief overview of recent activities in pursuit of this mission. We summarize corpus development activities at CSLU and describe new corpora useful for research on speciic tasks: alphabet recognition , numbers recognition, large vocabulary word recognition, and yes/no recognition. We then discuss our two newest data collection efforts , Cellular Speech and the 22-Language Telephone Speech Corpus. All CSLU corpora are available at no charge to academic institutions. Corpus development activities at CSLU include: (a) Protocol development (b) Data collection (c) Development of tools (d) Transcription (e) Convention development and documentation (f) Reliability studies (g) Distribution of data Protocol Development One of the rst steps in any data collection is the development of a protocol that will elicit responses appropriate for the kind of systems one is planning to build. We design our protocols in a variety of ways, but we maintain a focus on continuous or \natural" telephone speech. In addition to our continuous speech corpora, we have various large corpora containing repeated or isolated words and phrases, spoken letters names and numbers. See Table 1 for a list of all speech corpora at CSLU. Data Collection Once the protocol is determined , it is necessary to create an automatic system to answer the telephone and record each caller's responses. Our system is accessible via a toll-free number throughout the United States. This not only increases our subject base, but also decreases possible dialectical bias. In addition to English data collections, we are collecting speech in 21 other languages as a part of our 22 Language Corpus. Transcribing Speech We generally have a staa of 5-10 trained transcribers who label at various levels. We produce transcriptions aligned to the waveform, both word and pho-netic levels, as well as word transcriptions not aligned to the waveform. All transcriptions explicitly capture any extraneous noise in the signal , such as breath noise or background speech. Pauses, or periods of relative silence in the signal , are also marked. Convention Development CSLU develops and documents all transcription conventions used in the transcribed speech corpora. For complete coverage of current CSLU labeling conventions, contact Terri Lander at tlander@cse.ogi.edu, or see 1] and 2]. Labeler Reliability We periodically run experiments measuring inter-labeler reliability. This provides …