论文信息 - Utilizing Crowdsourcing for the Construction of Chinese-Mongolian Speech Corpus with Evaluation Mechanism

Utilizing Crowdsourcing for the Construction of Chinese-Mongolian Speech Corpus with Evaluation Mechanism

Crowdsourcing has been used recently as an alternative to traditional costly annotation by many natural language processing groups. In this paper, we explore the use of Wechat Official Account Platform (WOAP) in order to build a speech corpus and to assess the feasibility of using WOAP followers (also known as contributors) to assemble speech corpus of Mongolian. A Mongolian language qualification test was used to filter out potential non-qualified participants. We gathered natural speech recordings in our daily life, and constructed a Chinese-Mongolian Speech Corpus (CMSC) of 31472 utterances from 296 native speakers who are fluent in Mongolian, totalling 30.8 h of speech. Then, an evaluation experiment was performed, in where the contributors were asked to choose a correct sentence from a multiple choice list to ensure the high-quality of corpus. The results obtained so far showed that crowdsourcing for constructing CMSC with an evaluation mechanism could be more effective than traditional experiments requiring expertise.

Meng Zhao | Heyan Huang | Shumin Shi | Rihai Su

[1] Steve Crowdy. Spoken Corpus Design , 1993 .

[2] Victor Kuperman,et al. Crowdsourcing and language studies: the new generation of linguistic data , 2010, Mturk@HLT-NAACL.

[3] Elena Filatova,et al. Irony and Sarcasm: Corpus Generation and Analysis Using Crowdsourcing , 2012, LREC.

[4] Dawn Knight,et al. Building a spoken corpus , 2010 .

[5] Mark Dredze,et al. Annotating Named Entities in Twitter Data with Crowdsourcing , 2010, Mturk@HLT-NAACL.

[6] Matt Post,et al. Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing , 2012, WMT@NAACL-HLT.

[7] Kalina Bontcheva,et al. Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines , 2014, LREC.

[8] Zhao Jian-don. Research on HMM-based Mongolian Speech Synthesis , 2014 .

[9] M. de Rijke,et al. EuroGOV: Engineering a Multilingual Web Corpus , 2005, CLEF.

[10] Sen Zhang. Processing of Mongolian by Computer , 2006 .

[11] Graeme D. Kennedy,et al. Book Reviews: An Introduction to Corpus Linguistics , 1999, CL.