The CorCenCC project1 (Corpws Cenedlaethol Cymraeg Cyfoes or National Corpus of Contemprary Welsh in English; www.corcencc.org) aims to assemble a 10 million-word corpus of the Welsh language across a range of contemporary contexts from spoken, written and e-language sources. In keeping with its contemporary aspect, a key innovation of the project is to facilitate crowdsourced contributions to the corpus, giving Welsh speakers the opportunity to directly involve themselves in the creation of the corpus. This is of vital importance in the Welsh context, in which community pride is strong and for which an open linguistic resource that properly represents the constantly-evolving landscape of contemporary Welsh speakers and the way their language is used is expected to have a wide-reaching impact on the way publishers, policy-makers, the education sector, academic researchers and many more work with Welsh going forward. This presentation introduces the CorCenCC Crowdsourcing App, a mobile application designed to facilitate direct contribution of spoken language data to the corpus. Spoken language data will comprise 400,000 of the 10 million word corpus (alongside 400,000 word of written data and 200,000 words of electronic language such as blogs and emails), and app users can contribute directly to this number by recording their Welsh-language narratives (Figures 1 and 2), attaching and editing appropriate metadata to describe the recorded conversations, and uploading them for inclusion in the final corpus. The metadata attached to the recorded conversations includes details about where the recording was made, who else was involved in the recording, and tags that future corpus tools will be able to use to search the data in the final corpus.
[1]
Ian McGraw,et al.
A self-labeling speech corpus: collecting spoken words with an online educational game
,
2009,
INTERSPEECH.
[2]
Steven Moran,et al.
A Crowdsourcing Smartphone Application for Swiss German: Putting Language Documentation in the Hands of the Users
,
2014,
LREC.
[3]
J. B. Brooke,et al.
SUS: A 'Quick and Dirty' Usability Scale
,
1996
.
[4]
Ian R. Lane,et al.
Tools for Collecting Speech Corpora via Mechanical-Turk
,
2010,
Mturk@HLT-NAACL.
[5]
Chris Callison-Burch,et al.
Creating Speech and Language Data With Amazon’s Mechanical Turk
,
2010,
Mturk@HLT-NAACL.
[6]
Chris Callison-Burch,et al.
Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription
,
2010,
NAACL.
[7]
Chris Callison-Burch,et al.
Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk
,
2009,
EMNLP.