Compilation, transcription and usage of a reference speech corpus: the case of the Slovene corpus GOS

In recent years, building reference speech corpora was an important part of the activities which provided the necessary linguistic infrastructure in many European countries, for languages with many speakers (e.g., French, German, Spanish, Italian) as well as for those with smaller numbers of speakers (e.g., Swedish, Dutch, Czech, Slovak). This paper describes the process of the creation of a reference speech corpus and its distribution to potential users, as it was done in the case of the Slovene corpus GOS. The corpus structure and fieldwork experiences with recording, labelling system, and two levels of transcription (pronunciation-based and standardized) are described, as well as the main characteristics of the corpus interface (web concordancer) and the availability of the original corpus files.

[1]  Adam Kilgarriff,et al.  The Sketch Engine , 2004 .

[2]  Andrej Zgank Three-Stage Framework for Unsupervised Acoustic Modeling Using Untranscribed Spoken Content , 2010 .

[3]  Winnie Cheng,et al.  A Corpus-Driven Study of Discourse Intonation: The Hong Kong Corpus of Spoken English (Prosodic) , 2008 .

[4]  Shlomo Izre'el,et al.  Designing CoSIH: The Corpus of Spoken Israeli Hebrew , 2001 .

[5]  Petr Pořízka Olomouc Corpus of Spoken Czech: characterization and main features of the project , 2013 .

[6]  J. M. Atkinson Structures of Social Action: Contents , 1985 .

[7]  Elena Grishina Spoken Russian in the Russian National Corpus (RNC) , 2006, LREC.

[8]  Lou Boves,et al.  Experiences from the Spoken Dutch Corpus Project , 2002, LREC.

[9]  Renata Savy,et al.  CLIPS: diatopic, diamesic and diaphasic variations of spoken Italian , 2009 .

[10]  Elisabeth Ahlsén,et al.  The Spoken Language Corpus at the Linguistics Department, Göteborg University , 2000 .

[11]  Mark Liberman,et al.  Transcriber: Development and use of a tool for assisting speech corpora production , 2001, Speech Commun..

[12]  G Williams,et al.  Proceedings of the Eleventh EURALEX International Congress , 2004 .

[13]  Adam Przepiórkowski,et al.  Towards the National Corpus of Polish , 2008, LREC.

[14]  Mirjam Sepesy Maucec,et al.  Large vocabulary continuous speech recognition of an inflected language using stems and endings , 2007, Speech Commun..

[15]  Melita Zemljak,et al.  Računalniški simbolni fonetični zapis slovenskega govora , 2002 .

[16]  Milan Rusko,et al.  Corpus of Spoken Slovak Language , 2007 .

[17]  李友 新人类一族——猜火车(Trainspotting) , 1997 .

[18]  Antonio Moreno-Sandoval,et al.  The C-ORAL-ROM CORPUS. A Multilingual Resource of Spontaneous Speech for Romance Languages , 2004, LREC.

[19]  Jürgen Weissenborn,et al.  A data base for the study of first language acquisition. Childes: Child language data exchange system , 1989 .

[20]  Mirjam Sepesy Maučec,et al.  Slovenian spontaneous speech recognition and acoustic modeling of filled pauses and onomatopoeas , 2008 .

[21]  Heliana Mello,et al.  Para a transcriçao da fala espontânea: o caso do C-ORAL-BRASIL , 2009 .

[22]  Olivier Baude,et al.  Corpus de la parole : collecte, catalogage, conservation et diffusion des ressources orales sur le français et les langues de France , 2011 .

[23]  Olivier Baude,et al.  Corpus de la parole , 2011, Trait. Autom. des Langues.

[24]  Hermann Ney,et al.  A word graph algorithm for large vocabulary continuous speech recognition , 1994, Comput. Speech Lang..