Transcribing audio-video archives

This paper addresses the automatic transcription of audiovideo archives using a state-of-the-art broadcast news speech transcription system. A 9-hour corpus spanning the latter half of the 20th century (1945–1995) has been transcribed and an analysis of the transcription quality carried out. In addition to the challenges of transcribing heterogenous broadcast news data, we are faced with changing properties of the archive over time, such as the audio quality, the speaking style, vocabulary items and manner of expression. After assessing the performance of the transcription system, several paths are explored in an attempt to reduce the mismatch between the acoustic and language models and the archived data.