CASIA-CASSIL: a Chinese Telephone Conversation Corpus in Real Scenarios with Multi-leveled Annotation

CASIA-CASSIL is a large-scale corpus base of Chinese human-human naturally-occurring telephone conversations in restricted domains. The first edition consists of 792 90-second conversations belonging to tourism domain, which are selected from 7,639 spontaneous telephone recordings in real scenarios. The corpus is now being annotated with wide range of linguistic and paralinguistic information in multi-levels. The annotations include Turns, Speaker Gender, Orthographic Transcription, Chinese Syllable, Chinese Phonetic Transcription, Prosodic Boundary, Stress of Sentence, Non-Speech Sounds, Voice Quality, Topic, Dialog-act and Adjacency Pairs, Ill-formedness, and Expressive Emotion as well, 13 levels in total. The abundant annotation will be effective especially for studying Chinese spoken language phenomena. This paper describes the whole process to build the conversation corpus, including collecting and selecting the original data, and the follow-up process such as transcribing, annotating, and so on. CASIA-CASSIL is being extended to a large scale corpus base of annotated Chinese dialogs for spoken Chinese study.

[1]  Aijun Li,et al.  CORPUS DESIGN AND ANNOTATION FOR SPEECH SYNTHESIS AND RECOGNITION , 2006 .

[2]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[3]  William J. Byrne,et al.  CASS: a phonetically transcribed corpus of mandarin spontaneous speech , 2000, INTERSPEECH.

[4]  Li Aijun,et al.  CHINESE PROSODY AND PROSODIC LABELING OF SPONTANEOUS SPEECH , 2002 .

[5]  Elizabeth Shriberg,et al.  Meeting Recorder Project: Dialog Act Labeling Guide , 2004 .

[6]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..