This paper presents the BNSI Slovenian Broadcast News database project. The result of the project is a database with speech and text corpus oriented toward large vocabulary continuous speech recognition in general domain. The speech corpus consists of 36 hours of transcribed evening and late night news. The raw database material was captured in the archive of national broadcaster RTV Slovenia that was a partner in the project. General Broadcast News transcription conventions were supplemented with language specific rules. The Transcriber tool was used to produce the transcriptions. All additional tools needed during the annotation process were also installed on a computer. Statistics of speech corpus is presented in the paper. The BNSI text corpus is generated from broadcasts’ scenarios for a period of 7 years. 600 monthly shows’ collections of text are included. They will be used to improve the language modeling in highly inflectional Slovenian language. The BNSI Slovenian Broadcast News database will be available through ELRA/ELDA.
[1]
Patrick Cardinal,et al.
Automated closed-captioning of live TV broadcast news in French
,
2003,
INTERSPEECH.
[2]
Mark Liberman,et al.
Transcriber: Development and use of a tool for assisting speech corpora production
,
2001,
Speech Commun..
[3]
Andrej Zgank,et al.
Large Vocabulary Continuous Speech Recognizer for Slovenian Language
,
2001,
TSD.
[4]
Mirjam Sepesy Maucec,et al.
Acquisition and Annotation of Slovenian Broadcast News Database
,
2004,
LREC.
[5]
Zdravko Kacic,et al.
Issues in Design and Collection of Large Telephone Speech Corpus for Slovenian Language
,
2000,
LREC.
[6]
David S. Pallett.
The role of the National Institute of Standards and Technology in DARPA's Broadcast News continuous speech recognition research program
,
2002,
Speech Commun..