Spoken Language Resources at LUKS of the University of Ljubljana

This paper presents the Slovene-language spoken resources that were acquired at the Laboratory of Artificial Perception, Systems and Cybernetics (LUKS) at the Faculty of Electrical Engineering, University of Ljubljana over the past ten years. The resources consist of:• isolated-spoken-word corpora designed for phonetic research of the Slovene spoken language;• read-speech corpora from dialogues relating to air flight information;• isolated-word corpora, designed for studying the Slovene spoken diphthongs;• Slovene diphone corpora used for text-to-speech synthesis systems;• a weather forecast speech database, as an attempt to capture radio and television broadcast news in the Slovene language; and• read- and spontaneous-speech corpora used to study the effects of the psycho physical conditions of the speakers on their speech characteristics.All the resources are accompanied by relevant text transcriptions, lexicons and various segmentation labels. The read-speech corpora relating to the air flight information domain also are annotated prosodically and semantically. The words in the orthographic transcription were automatically tagged for their lemma and morphosyntactic description. Many of the mentioned speech resources are freely available for basic research purposes in speech technology and linguistics. In this paper we describe all the resources in more detail and give a brief description of their use in the spoken language technology products developed at LUKS.

[1]  Tomaz Erjavec,et al.  The MULTEXT-East Slovene Lexicon , 1998 .

[2]  Nancy Ide,et al.  Standardised specifications, development and assessment of large morpho-lexical resources for six central and eastern european languages , 1998, LREC.

[3]  Elmar Nöth,et al.  SQEL: a multilingual and multifunctional dialogue system , 1998, ICSLP.

[4]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange , 1994 .

[5]  Nancy Ide,et al.  Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages , 1998, COLING-ACL.

[6]  Simon Dobrisek,et al.  Acoustical modelling of phone transitions: biphones and diphones - what are the differences? , 1999, EUROSPEECH.

[7]  M Brenner,et al.  Speech analysis as an index of alcohol intoxication--the Exxon Valdez accident. , 1991, Aviation, space, and environmental medicine.

[8]  Simon Dobrisek,et al.  Evolution of the Information-Retrieval System for Blind and Visually-Impaired People , 2003, Int. J. Speech Technol..

[9]  Mark Liberman,et al.  Transcriber: Development and use of a tool for assisting speech corpora production , 2001, Speech Commun..

[10]  Nikola Pavesic,et al.  Sentence hypothesisation using NG-gram models , 1995, EUROSPEECH.

[11]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[12]  Simon Dobrisek,et al.  Segmentation and Labelling of Slovenian Diphone Inventories , 1996, COLING.

[13]  Bernd Möbius Corpus-based speech synthesis : Methods and challenges , 2000 .

[14]  France Mihelic,et al.  Speech Features Extraction Using Cone-Shaped Kernel Distribution , 2002, TSD.

[15]  Zdravko Kacic,et al.  Issues in Design and Collection of Large Telephone Speech Corpus for Slovenian Language , 2000, LREC.

[16]  Steve Young,et al.  The HTK book , 1995 .

[17]  E. Vajda Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet , 2000 .

[18]  Simon Dobrisek,et al.  Language Model Representations for the GOPOLIS Database , 1999, TSD.