AGH corpus of Polish speech

AbstractA corpus of Polish speech, which has been collected for the purpose of automatic speech recognition (ASR) and text-to-speech (TTS) systems applications, is presented. The corpus consists of several groups of recordings: read sentences, spoken commands, a phonetically balanced TTS training corpus, telephonic speech and others. In summary duration of recordings is above 25 h. Number of unique speakers amounts to 166. The majority of them being in an age group of 20–35 and one third of them being female. Analysis of unique word occurrence frequency in relation to larger text resources has been concluded. From them, most commonly appearing words have been found and presented. The corpus was used as training data for the ASR system. Results of cross-validation training and testing the SARMATA ASR system using our corpus have shown that phrase recognition rate is 91.9 %. The corpus was additionally evaluated in comparative test against the CORPORA corpus, which had shown major increase in phrase recognition rate in favour of our corpus.

[1]  Artur Flach,et al.  Testing of a Device for Positioning Measuring Microphones in Anechoic and Reverberation Chambers , 2012 .

[2]  Géza Németh,et al.  The EASR Corpora of European Portuguese, French, Hungarian and Polish Elderly Speech , 2014, LREC.

[3]  Ryszard Gubrynowicz,et al.  User-Centered Design for a Voice Portal , 2009, Aspects of Natural Language Processing.

[4]  P Denes AUTOMATIC SPEECH RECOGNITION: EXPERIMENTS WITH A RECOGNISER USING LINGUISTIC STATISTICS , 1960 .

[5]  Laurent Besacier,et al.  Automatic Speech Recognition for Under-Resourced Languages: Application to Vietnamese Language , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  V. Fromkin Speech errors as linguistic evidence , 1976 .

[7]  S. Rochester The significance of pauses in spontaneous speech , 1973, Journal of psycholinguistic research.

[8]  Dawid Skurzok,et al.  N-Grams Model for Polish , 2011 .

[9]  Tanja Schultz,et al.  Globalphone: a multilingual speech and text database developed at karlsruhe university , 2002, INTERSPEECH.

[10]  P. Denes,et al.  Spoken Digit Recognition Using Time‐Frequency Pattern Matching , 1960 .

[11]  Tanja Schultz,et al.  Fast bootstrapping of LVCSR systems with multilingual phoneme sets , 1997, EUROSPEECH.

[12]  Mirosław Bańko,et al.  Narodowy Korpus Języka Polskiego , 2012 .

[13]  Paolo Rosso,et al.  A multidimensional approach for detecting irony in Twitter , 2013, Lang. Resour. Evaluation.

[14]  Philip Resnik,et al.  Mining the Web for Bilingual Text , 1999, ACL.

[15]  Mariusz Ziólko,et al.  Triphone Statistics for Polish Language , 2009, LTC.

[16]  Poland. Główny Urząd Statystyczny,et al.  Ludność : stan i struktura demograficzno-społeczna, 2002 , 2003 .

[17]  P. Klosowski,et al.  Polish semantic speech recognition expert system supporting electronic design system , 2008, 2008 Conference on Human System Interactions.

[18]  Richard L. Epstein Open image in new window , 1979 .

[19]  Gregory Grefenstette,et al.  Web as Corpus , 2003 .

[20]  Leslaw Pawlaczyk,et al.  Skrybot - A System for Automatic Speech Recognition of Polish Language , 2009, ICMMI.

[21]  Stefan Grocholewski CORPORA - speech database for Polish diphones , 1997, EUROSPEECH.

[22]  Grazyna Demenko,et al.  JURISDIC: Polish Speech Database for Taking Dictation of Legal Texts , 2008, LREC.

[23]  M. Hallet,et al.  Speech Recognition: A Model and a Program for Research* , 1998 .

[24]  Kevin P. Scannell The Crúbadán Project: Corpus building for under-resourced languages , 2007 .

[25]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[26]  Hermann Ney,et al.  Cross-language bootstrapping for unsupervised acoustic model training: rapid development of a Polish speech recognition system , 2009, INTERSPEECH.

[27]  Raja Noor Ainon,et al.  Phonetically rich and balanced text and speech corpora for Arabic language , 2012, Lang. Resour. Evaluation.

[28]  Mariusz Ziólko,et al.  Automatic Speech Recognition System Dedicated for Polish , 2011, INTERSPEECH.