Developing computational infrastructure for the CorCenCC corpus: The National Corpus of Contemporary Welsh

CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes—National Corpus of Contemporary Welsh) is the first comprehensive corpus of Welsh designed to be reflective of language use across communication types, genres, speakers, language varieties (regional and social) and contexts. This article focuses on the computational infrastructure that we have designed to support data collection for CorCenCC, and the subsequent uses of the corpus which include lexicography, pedagogical research and corpus analysis. A grass-roots approach to design has been adopted, that has adapted and extended previous corpus-building and introduced new features as required for this specific context and language. The key pillars of the infrastructure include a framework that supports metadata collection, an innovative mobile application designed to collect spoken data (utilising a crowdsourcing approach), a backend database that stores curated data and a web-based interface that allows users to query the data online. A usability study was conducted to evaluate the user facing tools and to suggest directions for future improvements. Though the infrastructure was developed for Welsh language collection, its design can be re-used to support corpus development in other minority or major language contexts, broadening the potential utility and impact of this work.

[1]  Mark Davies,et al.  The Corpus of Contemporary American English as the first reliable monitor corpus of English , 2010, Lit. Linguistic Comput..

[2]  Douglas Biber,et al.  Representativeness in corpus design , 1993 .

[3]  Briony Williams,et al.  A welsh speech database: preliminary results , 1999, EUROSPEECH.

[4]  Elhuyar Fundazioa,et al.  ZT Corpus Annotation and tools for Basque corpora , .

[5]  Delyth Prys,et al.  Gathering Data for Speech Technology in the Welsh Language: A Case Study , 2018, LREC 2018.

[6]  Erik Duval,et al.  Metadata Principles and Practicalities , 2002, D Lib Mag..

[7]  J. Herring,et al.  Building bilingual corpora , 2014 .

[8]  Adam Kilgarriff,et al.  The Sketch Engine: ten years on , 2014 .

[9]  Marc Kupietz,et al.  The German Reference Corpus DeReKo: New Developments - New Opportunities , 2018, LREC.

[10]  Atro Voutilainen,et al.  A language-independent system for parsing unrestricted text , 1995 .

[11]  Andrew Hardie,et al.  CQPweb — combining power, flexibility and usability in a corpus analysis tool , 2012 .

[12]  Dawn Knight,et al.  Towards a Welsh Semantic Annotation System , 2018, LREC.

[13]  Catherine Smith,et al.  Crowdsourcing formulaic phrases: towards a new type of spoken corpus , 2020, Corpora.

[14]  Kevin P. Scannell The Crúbadán Project: Corpus building for under-resourced languages , 2007 .

[15]  Robbie Love Overcoming Challenges in Corpus Construction , 2020 .

[16]  Daren C. Brabham Crowdsourcing as a Model for Problem Solving , 2008 .

[17]  Anthony McEnery,et al.  The UCREL Semantic Analysis System , 2004 .

[18]  Charles F. Meyer English Corpus Linguistics: Frontmatter , 2002 .

[19]  Riitta Jääskeläinen Think-aloud protocol , 2010 .

[20]  Karel Kucera The Czech National Corpus: Principles, Design, and Results , 2002, Lit. Linguistic Comput..

[21]  B. MacWhinney The CHILDES project: tools for analyzing talk , 1992 .

[22]  Dawn Knight,et al.  Formality in Digital Discourse: A Study of Hedging in CANELC , 2013 .

[23]  Dawn Knight,et al.  The CorCenCC crowdsourcing app: a bespoke tool for the user-driven creation of the national corpus of contemporary Welsh , 2017 .

[24]  Gemma Boleda,et al.  CUCWeb: A Catalan corpus built from the Web , 2006 .

[25]  Martin Wynne,et al.  Developing Linguistic Corpora: a Guide to Good Practice , 2005 .

[26]  Dawn Knight,et al.  Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh , 2018, LREC.

[27]  Fernando González-Ladrón-de-Guevara,et al.  Towards an integrated crowdsourcing definition , 2012, J. Inf. Sci..

[28]  G. Leech The state of the art in corpus linguistics , 2014 .

[29]  Fred Karlsson,et al.  Constraint Grammar as a Framework for Parsing Running Text , 1990, COLING.

[30]  Adam Kilgarriff,et al.  The TenTen Corpus Family , 2013 .

[31]  Thomas Schmidt,et al.  The Database for Spoken German ― DGD2 , 2014, LREC.

[32]  Nicholas Ostler,et al.  Corpus Design Criteria , 1992 .

[33]  Deborah E. White,et al.  Thematic Analysis , 2017 .

[34]  Rita C Simpson-Vlach,et al.  The MICASE Handbook: A Resource for Users of the Michigan Corpus of Academic Spoken English , 2006 .

[35]  Tony McEnery,et al.  Introduction:compiling and analysing the Spoken British National Corpus 2014 , 2017 .

[36]  Michael McCarthy,et al.  Exploring Spoken English , 1997 .

[37]  Guy Aston,et al.  The BNC Handbook: Exploring the British National Corpus with SARA , 1998 .

[38]  Siqi Liu Overcoming Challenges in Corpus Construction: The Spoken British National Corpus 2014, by Robbie Love. New York: Routledge, 2020. ISBN 978-1-138-36737-1, xviii + 202 pages , 2021 .

[39]  R. Carter,et al.  Talking, Creating: Interactional Language, Creativity, and Context , 2004 .

[40]  Robert Fuchs,et al.  Expanding horizons in the study of World Englishes with the 1.9 billion word Global Web-based English Corpus (GloWbE) , 2015 .