Large-scale processing, indexing and search system for Czech audio-visual cultural heritage archives

This paper describes a complex system developed for processing, indexing and accessing data collected in large audio and audio-visual archives that make an important part of Czech cultural heritage. Recently, the system is being applied to the Czech Radio archive, namely to its oral history segment with more than 200.000 individual recordings covering almost ninety years of broadcasting in the Czech Republic and former Czechoslovakia. The ultimate goals are a) to transcribe a significant portion of the archive - with the support of speech, speaker and language recognition technology, b) index the transcriptions, and c) make the audio and text files fully searchable. So far, the system has processed and indexed over 75.000 spoken documents. Most of them come from the last two decades, but the recent demo collection includes also a series of presidential speeches since 1934. The full coverage of the archive should be available by the end of 2014.

[1]  Jan Silovský,et al.  Using Unsupervised Feature-Based Speaker Adaptation for Improved Transcription of Spoken Archives , 2011, INTERSPEECH.

[2]  Johan Oomen,et al.  Publishing Europe's Television Heritage on the Web , 2011, SDA.

[3]  Jan Nouza,et al.  A System for Information Retrieval from Large Records of Czech Spoken Data , 2006, TSD.

[4]  Roeland Ordelman,et al.  Exploration of audiovisual heritage using audio indexing technology , 2006 .

[5]  Jan Silovský,et al.  Czech-to-slovak adapted broadcast news transcription system , 2008, INTERSPEECH.

[6]  Jan Silovský,et al.  Challenges in Speech Processing of Slavic Languages (Case Studies in Speech Recognition of Czech and Slovak) , 2009, COST 2102 Training School.

[7]  Jan Nouza,et al.  System for automatic collection, annotation and indexing of Czech broadcast speech with full-text search , 2010, Melecon 2010 - 2010 15th IEEE Mediterranean Electrotechnical Conference.

[8]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[9]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[10]  Jan Silovský,et al.  Voice Technology to Enable Sophisticated Access to Historical Audio Archive of the Czech Radio , 2011, MM4CH.

[11]  Jan Silovský,et al.  Speaker diarization of broadcast streams using two-stage clustering based on i-vectors and cosine distance scoring , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Jan Nouza,et al.  Multi-words in the Czech TV/radio News Transcription system , 2006 .