BNSI Slovenian broadcast news database - speech and text corpus

This paper presents the BNSI Slovenian Broadcast News database project. The result of the project is a database with speech and text corpus oriented toward large vocabulary continuous speech recognition in general domain. The speech corpus consists of 36 hours of transcribed evening and late night news. The raw database material was captured in the archive of national broadcaster RTV Slovenia that was a partner in the project. General Broadcast News transcription conventions were supplemented with language specific rules. The Transcriber tool was used to produce the transcriptions. All additional tools needed during the annotation process were also installed on a computer. Statistics of speech corpus is presented in the paper. The BNSI text corpus is generated from broadcasts’ scenarios for a period of 7 years. 600 monthly shows’ collections of text are included. They will be used to improve the language modeling in highly inflectional Slovenian language. The BNSI Slovenian Broadcast News database will be available through ELRA/ELDA.