A Very Large Scale Mandarin Chinese Broadcast Collection for the GALE Program

In this paper, we present the design, collection, transcription and analysis of a Mandarin Chinese Broadcast Collection of over 3000 hours. The data was collected by Hong Kong University of Science and Technology (HKUST) in China on a cable TV and satellite transmission platform established in support of the DARPA Global Autonomous Language Exploitation (GALE) program. The collection includes broadcast news (BN) and broadcast conversation (BC) including talk shows, roundtable discussions, call-in shows, editorials and other conversational programs that focus on news and current events. HKUST also collects detailed information about all recorded programs. A subset of BC and BN recordings are manually transcribed with standard Chinese characters in UTF-8 encoding, using specific mark-ups for a small set of spontaneous and conversational speech phenomena. The collection is among the largest and first of its kind for Mandarin Chinese Broadcast speech, providing abundant and diverse samples for Mandarin speech recognition and other application-dependent tasks, such as spontaneous speech processing and recognition, topic detection, information retrieval, and speaker recognition. HKUST’s acoustic analysis of 500 hours of the speech and transcripts demonstrates the positive impact this data could have on system performance.

[1]  Richard Winski,et al.  European speech databases for telephone applications , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Pascale Fung,et al.  State-dependent phonetic tied mixtures with pronunciation modeling for spontaneous speech recognition , 2004, IEEE Transactions on Speech and Audio Processing.

[3]  Meghan Lammie Glenn,et al.  XTrans: a speech annotation and transcription tool , 2009, INTERSPEECH.

[4]  Thomas Fang Zheng,et al.  Automatic generation of pronunciation lexicons for Mandarin spontaneous speech , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[5]  Tan Lee,et al.  Spoken language resources for Cantonese speech processing , 2002, Speech Commun..

[6]  Katsuhiko Shirai,et al.  Japanese large-vocabulary continuous-speech recognition using a newspaper corpus and broadcast news , 1999, Speech Commun..

[7]  Denise DiPersio,et al.  Large Scale Multilingual Broadcast Data Collection to Support Machine Translation and Distillation Technology Development , 2010, LREC.

[8]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Mark Liberman,et al.  Transcriber: Development and use of a tool for assisting speech corpora production , 2001, Speech Commun..

[10]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .